Knowledgebase Home Page  >  SearchUnit  >  Version 2 Articles
Search the Knowledge Base
How can I programmatically detect invalid document URLs in the index and remove them?
https://keyoti.com/kb/Default.aspx?ToDo=view&questId=92&catId=66

Options

Print this page
Email this to a friend
The TestURI method in the Reader class provides a simple way to test a URI.  It returns (rather than throws) any WebException that occurred when connecting to the URL.  The following code checks each document in the index and removes ones with invalid URLs.  Note, because the removal of documents requires the removal of all document references (indexed words) this process can take a non trivial amount of time (ie. it may not be appropriate for a synchronous web page).
 
 
C#

using Keyoti.SearchEngine.Index;

using Keyoti.SearchEngine.Documents;

using Keyoti.SearchEngine;

.....

Configuration.xmlLocation = Request.MapPath("IndexDirectory");

DocumentIndex docInd = new DocumentIndex();

docInd.Open();

ArrayList docs = docInd.GetIndexedDocuments();

 

foreach (Document d in docs)

{

bool isValidURI = Keyoti.SearchEngine.Documents.Reader.TestURI( d.URI) == null;

if(!isValidURI)

docInd.RemoveDocument(d.URI.ToString());

}

docInd.Close();

 

 

VB.NET

Imports Keyoti.SearchEngine.Index

Imports Keyoti.SearchEngine.Documents

Imports Keyoti.SearchEngine
....

Configuration.xmlLocation = Request.MapPath("IndexDirectory")

Dim docInd As New DocumentIndex()

docInd.Open()

Dim docs As ArrayList = docInd.GetIndexedDocuments()

 

Dim d As Document
For Each d In  docs
  
  
   Dim isValidURI As Boolean = Keyoti.SearchEngine.Documents.Reader.TestURI(d.URI) Is Nothing
  
   If Not isValidURI Then
     
      docInd.RemoveDocument(d.URI.ToString())
   End If
Next d

docInd.Close()

 

 

[The isValidURI type is redeclared in the loop simply for ease in reading the method signature.]


Related Questions:

Attachments:

No attachments were found.