Index crawler giving wrong results - SearchUnit

Welcome Guest

Search | Active Topics | Log In | Register

Forum » Technical Support Questions » SearchUnit » Index crawler giving wrong results

Options

David

#1 Posted : Thursday, October 22, 2015 5:30:40 AM

Rank: Member

Groups: Registered

Joined: 10/22/2015
Posts: 12
Location: Adelaide

It seems the web crawler is confused by redirections?

I'm seeing multiple results similar to the following:
http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/how-to-apply-for-positions
which, if you try to view it, the server truncates to:
http://testserver/about/staff

Note, we are using a CMS.
Some pages are aliased (ie redirected) to others, eg /about/staff/contact redirects to /about/staff.
But the "jobs" link on that page uses the URL /about/jobs.

Am I doing something wrong, or is this a bug?

User Profile
Hide User Posts

Jim

#2 Posted : Thursday, October 22, 2015 4:48:20 PM

Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,669
Location: Canada

Hi, I believe when I've seen this in the past it's been due to how the link URLs were interpreted, and how 404s are handled. The crawler can occasionally generate incorrect URLs to try - and if the server/app returns a valid page for it, instead of a 404, it thinks the generated URL was correct. Then because that page content is the same as a previous page, it again generates an incorrect link, which the server treats as OK, and the whole thing happens again...

http://testserver/about/staff/contact/jobs/how-to-apply-for-positions
begets
http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/how-to-apply-for-positions
and so on

Which version are you using? Can you post/attach the browser view source for http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/
and http://testserver/about/staff/contact/jobs/how-to-apply-for-positions/jobs/

Jim

-your feedback is helpful to other users, thank you!

WWW

User Profile
Hide User Posts

David

#3 Posted : Friday, October 23, 2015 12:12:56 AM

Rank: Member

Groups: Registered

Joined: 10/22/2015
Posts: 12
Location: Adelaide

I'm using v6.0.

The way the CMS's aliasing works, any child of a redirection is ignored. So because we have an alias from /about/staff/contact to /about/staff, then /about/staff/contact/anything/anything/anything/anything is redirected to /about/staff. So, it will give a valid page.
But the Jobs page is /about/jobs, not ../jobs, so I don't see how it gets that one confused.

Surely the crawler should interpret the URLs more like an actual browser would - following what the server says your current URL is, rather than what you "think" it should be.

User Profile
Hide User Posts

Jim

#4 Posted : Friday, October 23, 2015 3:50:53 AM

Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,669
Location: Canada

Quote:

Surely the crawler should interpret the URLs more like an actual browser would - following what the server says your current URL is, rather than what you "think" it should be.

I agree, that's why it would be great to see the page, so I can figure out what's happening.

Eg. if you visit /about/staff/contact, is there a link like href="jobs"...? And then on the jobs page is there a link like href="how-to-apply-for-positions"..?

If for some reason it didn't detect that it was redirected from /about/staff/contact to /about/staff then any link like href="jobs" is going to lead to the creation of a URL as /about/staff/contact/jobs instead of what I think you want which is /about/staff/jobs

Does that jibe with what you're seeing?

Jim

-your feedback is helpful to other users, thank you!

WWW

User Profile
Hide User Posts

David

#5 Posted : Friday, October 23, 2015 5:22:52 AM

Rank: Member

Groups: Registered

Joined: 10/22/2015
Posts: 12
Location: Adelaide

Ah, yes, you're right.
The links are href="jobs", href="jobs/how-to-apply-for-positions".
So it's assuming a base directory of "/about/".

Since I can't change that menu (it's generated by the CMS), I'll try limiting crawl depth to 2.
The homepage has the full site menu, so this should be enough to fetch all genuine URLs without any bogus ones.

Thanks for your suggestions.

User Profile
Hide User Posts

Forum Jump

You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Watch this topic
Print this topic

Normal
Threaded

Index crawler giving wrong results - SearchUnit - Forum