SearchUnit Documentation: How do I...: ...programmatically import

'Importing' a website/file-system folder/database/DataSet means that the indexer will scan for all available documents/pages/data and index everything that matches the import criteria. Reimporting will cause the indexer to rescan the source for changes (where possible, otherwise reindex everything). To import programmatically, use the appropriate Import method in DocumentIndex;


DocumentIndex documentIndex = new DocumentIndex(configuration);
//import a website
documentIndex.ImportWebsite( startURL );
//or like this
documentIndex.Import(new WebsiteBasedIndexableSourceRecord( startURL, pathMatchesToBeIgnored, pathMatchesToBeIncluded));

//or import a file system folder
documentIndex.ImportFileSystemFolder(localFolderPath, virtualPath, targetMatchList, ignoreMatchList, recurseSubFolders);

//or import a database
documentIndex.ImportDatabase(sourceType, connectionString, sqlQuery, uniqueColumnName, resultUrlFormat);

//or import a DataSet (from an assembly)
documentIndex.ImportCustomDataSet(assemblyFilePath, fullClassName, uniqueColumnName, resultUrlFormat);

documentIndex.Close();

VB.NET


Dim documentIndex As New DocumentIndex(configuration)
'import a website
documentIndex.ImportWebsite( startURL )
'or like this
documentIndex.Import(new WebsiteBasedIndexableSourceRecord( startURL, pathMatchesToBeIgnored, pathMatchesToBeIncluded))

'or import a file system folder
documentIndex.ImportFileSystemFolder(localFolderPath, virtualPath, targetMatchList, ignoreMatchList, recurseSubFolders)
'or import a database
documentIndex.ImportDatabase(sourceType, connectionString, sqlQuery, uniqueColumnName, resultUrlFormat)
'or import a DataSet (from an assembly)
documentIndex.ImportCustomDataSet(assemblyFilePath, fullClassName, uniqueColumnName, resultUrlFormat)
documentIndex.Close()

Instead of importing an entire source, it is possible to add documents/data to the index incrementally. This is ideal for updating the index as documents are created/uploaded.

DocumentIndex documentIndex = new DocumentIndex(configuration);
try{
	documentIndex.AddDocument(new Document("http://some/URL/document", configuration));
} finally {
	documentIndex.Close();
}

VB.NET


Dim documentIndex As DocumentIndex = New DocumentIndex(configuration)
Try
	documentIndex.AddDocument(new Document("http://some/URL/document", configuration))
Finally 
	documentIndex.Close()
End Try

Note that "AddDocument" may or may not complete in a trivial amount of time (the actual amount of time depends on many factors including machine load, document size/type, index size, whether the index is due optimization etc), therefore it is not advisable for use in web applications (as the web page doing the indexing will not return to the user until AddDocument has finished).

Asynchronous Adding (.NET 2 up)

Adding to the index asynchronously allows your code to return immediately (e.g. for a web application's upload document page to return immediately), while the document is queued up to be added to the index as soon as possible in the background. To do this use the AsynchronousQueue class (in namespace Keyoti.SearchEngine.Index) - which will queue up AddDocument operations and call them in their original order. AsynchronousQueue uses it's own instance of DocumentIndex, and will create and close that instance as necessary (therefore it is important not to have another instance of DocumentIndex open on the same index directory while there are items in the queue).


//...this code could be called in a button event handler in a web page for example

EventHandler finished = delegate(object sender, EventArgs e)
{
	//at this point the index directory is unlocked and there are no more items pending adding to the index.
};

AsynchronousQueue.QueueForIndexing(new Document("http://someURL/somepage.aspx", Configuration), finished);
AsynchronousQueue.QueueForIndexing(new Document("http://someURL/somepage2.aspx", Configuration), finished);

VB.NET


Private Sub MyFunc()
	'...this code could be called in a button event handler in a web page for example
	Dim finished As EventHandler = AddressOf Me.OnFinished
	AsynchronousQueue.QueueForIndexing(New Document("http://someURL/somepage.aspx", Configuration), finished)
	AsynchronousQueue.QueueForIndexing(New Document("http://someURL/somepage2.aspx", Configuration), finished)
End Sub

Private Sub OnFinished(ByVal sender As Object, ByVal e As EventArgs)
	'at this point the index directory is unlocked and there are no more items pending adding to the index.
End Sub

Removing One Document

Use the RemoveDocument method in DocumentIndex to remove a document from the index. It's important that the document URL matches exactly with the URL already in the index. Please pay attention to trailing slashes (e.g. http://localhost/) and ensure any spaces are encoded as %20.

Asynchronous Remove (.NET 2 up)

Removing from the index asynchronously allows your code to return immediately (e.g. for a web application's deleete document page to return immediately), while the document is queued up to be removed from the index as soon as possible in the background. To do this use the AsynchronousQueue class (in namespace Keyoti.SearchEngine.Index) - which will queue up RemoveDocument operations and call them in their original order. AsynchronousQueue uses it's own instance of DocumentIndex, and will create and close that instance as necessary (therefore it is important not to have another instance of DocumentIndex open on the same index directory while there are items in the queue).

This is the same queue as the asynchronous adding example uses and both add and remove operations can be mixed.


//...this code could be called in a button event handler in a web page for example

EventHandler finished = delegate(object sender, EventArgs e)
{
	//at this point the index directory is unlocked and there are no more items pending adding to the index.
};

AsynchronousQueue.QueueForRemoval(new Document("http://someURL/somepage.aspx", Configuration), finished);
AsynchronousQueue.QueueForRemoval(new Document("http://someURL/somepage2.aspx", Configuration), finished);

VB.NET


Private Sub MyFunc()
	'...this code could be called in a button event handler in a web page for example
	Dim finished As EventHandler = AddressOf Me.OnFinished
	AsynchronousQueue.QueueForRemoval(New Document("http://someURL/somepage.aspx", Configuration), finished)
	AsynchronousQueue.QueueForRemoval(New Document("http://someURL/somepage2.aspx", Configuration), finished)
End Sub

Private Sub OnFinished(ByVal sender As Object, ByVal e As EventArgs)
	'at this point the index directory is unlocked and there are no more items pending adding to the index.
End Sub

Removing a 'document' that originated in a DB

When a row is imported from a DB, we create our own URI for it. To delete that row/document, you need to recreate the URI.


IndexableSourceUri uri = new IndexableSourceUri(1, "d4", "col1");
//where 1 is the IndexableSource ID (see below)
//"d4" is the value in the unique field, that identifies the row to delete
//"col1" is the name of the unique field

documentIndex.RemoveDocument(new Document(uri.UriInstance.AbsoluteUri, Configuration));


ArrayList recs = documentIndex.GetIndexableSourceRecords();
(recs[0] as IndexableSourceRecord).ID;

It is possible to add 'documents' to the index that are defined by strings only. In other words, it is possible to index data without the data having to actually reside in a document/page/database etc. This can be useful in the following scenarios for example;

To do this, use the PreloadedDocument class, which is a simple class where you pass the 'URI' that will identify the indexed data/document, and specify it’s title, text and custom data - all as strings.


documentIndex.AddDocument(new PreloadedDocument(new Uri(uri), title, text, summary, null, null, null, customData, configuration));

VB.NET


documentIndex.AddDocument(new PreloadedDocument(new Uri(uri), title, text, summary, Nothing, Nothing, Nothing, customData, configuration))

Where;
-'uri' is the real or fictitious Uri of the 'document' - this can point to an actual document or just be used as an arbitrary identifier for the indexed data
-'title' is string title of the document, searchable by the user
-'text' is the text body, this is searchable by the user
-'summary' is used for the result summary if a 'static' summary type is selected in the configuration (otherwise the result summary is generated from the text content based on hits)
-The 3 null/nothings are respectively; content category list, location category name and security group list (please see the API docs)
-'customData' is any CustomData to be added to the document record
-'configuration' is the usual configuration object, as was used to create DocumentIndex

Removing A PreloadedDocument

To remove a 'document' added with PreloadedDocument, use documentIndex.RemoveDocument, passing in the same Uri that the document was created with.

Programmatic Importing & Indexing

Importing An Entire Source

Adding One Document

Asynchronous Adding (.NET 2 up)

Removing One Document

Asynchronous Remove (.NET 2 up)

Removing a 'document' that originated in a DB

Adding Data Directly As Strings

Removing A PreloadedDocument