Knowledgebase Home Page  >  SearchUnit
Search the Knowledge Base
Customizing/adding custom document parsers (Office PIA Word document parser example)
https://keyoti.com/kb/Default.aspx?ToDo=view&questId=99&catId=54

Options

Print this page
Email this to a friend
It is fairly simple to customize (through subclassing) the DocumentIndex class (which runs index builds) so that it uses custom code to parse certain document types.  In this example we show how to use a custom MS Word document parser, based on the Office Primary Interop Assemblies (automation).  To use this code in production would require Office installed on the machine used to build indexes, however, generally it describes the process for parser customization.
 
The attached project (which requires v1.3+) contains all code necessary and instructions for obtaining the Office PIAs - it was prepared for Office 2003, if you need help with Office XP please see the notes in the download or email support@keyoti.com.
 
The new Word document parser is contained in the class CustomParser, it is fairly standard automation code plus our code for separating strings into words.
 
To use this class it is necessary to first subclass DocumentIndex;
 

public class CustomDocumentIndex:DocumentIndex{

    string tempFilePath;

    public CustomDocumentIndex(string tempFilePath){ this.tempFilePath = tempFilePath; }

    protected override Document CreateNewDocument(DocumentRecord documentRecord){

        Document document = new Document(documentRecord);

        document.ParserProvider = new CustomParserProvider(tempFilePath);

        return document;

    }

}

 

 

This loads each Document that the indexer creates (during indexing) with a new CustomParserProvider object, which is where we can override the Word document parser;

public override Parser GetParser(string mimeType){

    if(mimeType==null)return null;

     if(mimeType=="application/msword")

        return new CustomParser(tempFilePath);

    else return base.GetParser(mimeType);

}

 

 

The build can then be performed programmatically using the CustomDocumentIndex class;

Keyoti.SearchEngine.Configuration.xmlLocation = Application.StartupPath+"\\..\\..\\IndexDirectory";

//Use our customized DocumentIndex subclass.

CustomDocumentIndex docIdx = new CustomDocumentIndex(Application.StartupPath+"\\..\\..\\TempDir\\temp.doc");

//Use as normal, add docs if necessary and build.

docIdx.Open();

docIdx.Build();

docIdx.Close();

 

That's it, the index is now built, and can be searched as usual, by pointing the SearchResult.IndexDirectory property at this index directory.


Related Questions:

Attachments: