Title Back Colour Keyoti Title Line Title Curve
Blue Box Top

Index crawler giving wrong results - SearchUnit - Forum

Welcome Guest Search | Active Topics | Log In | Register

Options
David
#1 Posted : Thursday, October 22, 2015 5:30:40 AM
Rank: Member

Groups: Registered

Joined: 10/22/2015
Posts: 12
Location: Adelaide
It seems the web crawler is confused by redirections?

I'm seeing multiple results similar to the following:
http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/how-to-apply-for-positions
which, if you try to view it, the server truncates to:
http://testserver/about/staff

Note, we are using a CMS.
Some pages are aliased (ie redirected) to others, eg /about/staff/contact redirects to /about/staff.
But the "jobs" link on that page uses the URL /about/jobs.

Am I doing something wrong, or is this a bug?
Jim
#2 Posted : Thursday, October 22, 2015 4:48:20 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Hi, I believe when I've seen this in the past it's been due to how the link URLs were interpreted, and how 404s are handled. The crawler can occasionally generate incorrect URLs to try - and if the server/app returns a valid page for it, instead of a 404, it thinks the generated URL was correct. Then because that page content is the same as a previous page, it again generates an incorrect link, which the server treats as OK, and the whole thing happens again...

http://testserver/about/staff/contact/jobs/how-to-apply-for-positions
begets
http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/how-to-apply-for-positions
and so on

Which version are you using? Can you post/attach the browser view source for http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/
and http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/

Jim
-your feedback is helpful to other users, thank you!


David
#3 Posted : Friday, October 23, 2015 12:12:56 AM
Rank: Member

Groups: Registered

Joined: 10/22/2015
Posts: 12
Location: Adelaide
I'm using v6.0.

The way the CMS's aliasing works, any child of a redirection is ignored. So because we have an alias from /about/staff/contact to /about/staff, then /about/staff/contact/anything/anything/anything/anything is redirected to /about/staff. So, it will give a valid page.
But the Jobs page is /about/jobs, not ../jobs, so I don't see how it gets that one confused.

Surely the crawler should interpret the URLs more like an actual browser would - following what the server says your current URL is, rather than what you "think" it should be.
Jim
#4 Posted : Friday, October 23, 2015 3:50:53 AM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Quote:
Surely the crawler should interpret the URLs more like an actual browser would - following what the server says your current URL is, rather than what you "think" it should be.


I agree, that's why it would be great to see the page, so I can figure out what's happening.

Eg. if you visit /about/staff/contact, is there a link like href="jobs"...? And then on the jobs page is there a link like href="how-to-apply-for-positions"..?

If for some reason it didn't detect that it was redirected from /about/staff/contact to /about/staff then any link like href="jobs" is going to lead to the creation of a URL as /about/staff/contact/jobs instead of what I think you want which is /about/staff/jobs

Does that jibe with what you're seeing?

Jim
-your feedback is helpful to other users, thank you!


David
#5 Posted : Friday, October 23, 2015 5:22:52 AM
Rank: Member

Groups: Registered

Joined: 10/22/2015
Posts: 12
Location: Adelaide
Ah, yes, you're right.
The links are href="jobs", href="jobs/how-to-apply-for-positions".
So it's assuming a base directory of "/about/".

Since I can't change that menu (it's generated by the CMS), I'll try limiting crawl depth to 2.
The homepage has the full site menu, so this should be enough to fetch all genuine URLs without any bogus ones.

Thanks for your suggestions.
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.




About | Contact | Site Map | Privacy Policy

Copyright © 2002- Keyoti Inc.