Title Back Colour Keyoti Title Line Title Curve
Blue Box Top

Duplicate documents - SearchUnit - Forum

Welcome Guest Search | Active Topics | Log In | Register

Options
DMacy
#1 Posted : Monday, June 19, 2017 11:14:35 PM
Rank: Advanced Member

Groups: Registered

Joined: 9/1/2010
Posts: 133
We have approximately 80,000 documents that we're indexing. We know that there are some duplicates in the collection. By duplicate, they're not usually the situation where two documents are completely identical in their contents. One document might have had some very minor editing changes or even just punctuation differences from another document. Are there any features in SearchUnit that would facilitate identifying two documents that we would potentially call duplicates?
Jim
#2 Posted : Tuesday, June 20, 2017 1:39:57 AM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
I can suggest looking at https://keyoti.com/produ...c-7b6a-5fdc20bd48d3.htm

Eg.
Code:

Dim sa = new SearchAgent(license, configuration)
Dim text = sa.GetDocumentText("http://host/somedoc.doc")


Sorry my VB.NET is rusty, but you get the idea.

You can get a list of indexed URLs from DocumentIndex.GetIndexedDocuments.

https://keyoti.com/produ...b-7812-76514d42a813.htm

That will give you the raw plain text that has been indexed. How you determine similarity from there I don't really know, but this might be a start https://www.reddit.com/r...da_document_similarity/


Jim
-your feedback is helpful to other users, thank you!


Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.




About | Contact | Site Map | Privacy Policy

Copyright © 2002- Keyoti Inc.