|
Rank: Member
Groups: Registered
Joined: 10/22/2015 Posts: 12 Location: Adelaide
|
It seems the web crawler is confused by redirections?
I'm seeing multiple results similar to the following: http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/how-to-apply-for-positions which, if you try to view it, the server truncates to: http://testserver/about/staff
Note, we are using a CMS. Some pages are aliased (ie redirected) to others, eg /about/staff/contact redirects to /about/staff. But the "jobs" link on that page uses the URL /about/jobs.
Am I doing something wrong, or is this a bug?
|
|
Rank: Advanced Member
Groups: Administrators, Registered
Joined: 8/13/2004 Posts: 2,669 Location: Canada
|
Hi, I believe when I've seen this in the past it's been due to how the link URLs were interpreted, and how 404s are handled. The crawler can occasionally generate incorrect URLs to try - and if the server/app returns a valid page for it, instead of a 404, it thinks the generated URL was correct. Then because that page content is the same as a previous page, it again generates an incorrect link, which the server treats as OK, and the whole thing happens again... http://testserver/about/staff/contact/jobs/how-to-apply-for-positions begets http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/how-to-apply-for-positions and so on Which version are you using? Can you post/attach the browser view source for http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/ and http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/ Jim -your feedback is helpful to other users, thank you!
|
|
Rank: Member
Groups: Registered
Joined: 10/22/2015 Posts: 12 Location: Adelaide
|
I'm using v6.0.
The way the CMS's aliasing works, any child of a redirection is ignored. So because we have an alias from /about/staff/contact to /about/staff, then /about/staff/contact/anything/anything/anything/anything is redirected to /about/staff. So, it will give a valid page. But the Jobs page is /about/jobs, not ../jobs, so I don't see how it gets that one confused.
Surely the crawler should interpret the URLs more like an actual browser would - following what the server says your current URL is, rather than what you "think" it should be.
|
|
Rank: Advanced Member
Groups: Administrators, Registered
Joined: 8/13/2004 Posts: 2,669 Location: Canada
|
Quote:Surely the crawler should interpret the URLs more like an actual browser would - following what the server says your current URL is, rather than what you "think" it should be. I agree, that's why it would be great to see the page, so I can figure out what's happening. Eg. if you visit /about/staff/contact, is there a link like href="jobs"...? And then on the jobs page is there a link like href="how-to-apply-for-positions"..? If for some reason it didn't detect that it was redirected from /about/staff/contact to /about/staff then any link like href="jobs" is going to lead to the creation of a URL as /about/staff/contact/jobs instead of what I think you want which is /about/staff/jobsDoes that jibe with what you're seeing? Jim -your feedback is helpful to other users, thank you!
|
|
Rank: Member
Groups: Registered
Joined: 10/22/2015 Posts: 12 Location: Adelaide
|
Ah, yes, you're right. The links are href="jobs", href="jobs/how-to-apply-for-positions". So it's assuming a base directory of "/about/".
Since I can't change that menu (it's generated by the CMS), I'll try limiting crawl depth to 2. The homepage has the full site menu, so this should be enough to fetch all genuine URLs without any bogus ones.
Thanks for your suggestions.
|
|