Duplicate documents - SearchUnit

Welcome Guest

Search | Active Topics | Log In | Register

Forum » Technical Support Questions » SearchUnit » Duplicate documents

Options

DMacy

#1 Posted : Monday, June 19, 2017 11:14:35 PM

Rank: Advanced Member

Groups: Registered

Joined: 9/1/2010
Posts: 136

We have approximately 80,000 documents that we're indexing. We know that there are some duplicates in the collection. By duplicate, they're not usually the situation where two documents are completely identical in their contents. One document might have had some very minor editing changes or even just punctuation differences from another document. Are there any features in SearchUnit that would facilitate identifying two documents that we would potentially call duplicates?

User Profile
Hide User Posts

Jim

#2 Posted : Tuesday, June 20, 2017 1:39:57 AM

Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,669
Location: Canada

I can suggest looking at https://keyoti.com/produ...c-7b6a-5fdc20bd48d3.htm

Eg.

Code:

Dim sa = new SearchAgent(license, configuration)
Dim text = sa.GetDocumentText("http://host/somedoc.doc")

Sorry my VB.NET is rusty, but you get the idea.

You can get a list of indexed URLs from DocumentIndex.GetIndexedDocuments.

https://keyoti.com/produ...b-7812-76514d42a813.htm

That will give you the raw plain text that has been indexed. How you determine similarity from there I don't really know, but this might be a start https://www.reddit.com/r...da_document_similarity/

Jim

-your feedback is helpful to other users, thank you!

WWW

User Profile
Hide User Posts

Forum Jump

You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Watch this topic
Print this topic

Normal
Threaded

Duplicate documents - SearchUnit - Forum