Knowledgebase Home Page  >  SearchUnit
Search the Knowledge Base
Customizing/adding custom document parsers (IFilter based PDF document parser example) (C#, VB.NET)
https://keyoti.com/kb/Default.aspx?ToDo=view&questId=120&catId=54

Options

Print this page
Email this to a friend

If you prefer not to use the default proprietary PDF parser included in the product, this article will show you an alternative.

It is fairly simple to customize (through the central event system) the PDF parser so that it uses custom code to parse certain document types.  In this example we show how to use a custom Adobe PDF document parser, based on the free Adobe PDF IFilter.  To use this code in production would require the IFilter be installed on the machine used to build indexes, however, generally it describes the process for parser customization.

 

How to use

To obtain the Acrobat IFilter please download it here

http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542

(note a 32 bit version is already installed with Acrobat Reader)

Once you have the IFilter installed, you should only need to do this

 

1. Download http://keyoti.com/downloads/SE_Acrobat_IFilter_Plugin.zip

 

2. Unzip, build the project and copy the resulting SE_Acrobat_IFilter_Plugin.dll to your Index Directoy.  Note that there is a prebuilt DLL in the project dir, unless it matches your version of our DLLs you should not use it, just use the one you've built in bin\debug or bin\release.

 

3. Open the configuration (ie. run the index manager tool and click the 'configuration' button)

 

a) set EventHandlerAssemblyPath to SE_Acrobat_IFilter_Plugin.dll

 

b) check IgnoreLastModifiedDate and uncheck UseFileSizeToIdentifyChange (this just ensures that it will reindex your PDFs regardless of whether they've changed).

 

4. Try reindexing and it should all work properly.  But, if it doesn't, enable logging in configuration, try indexing again and send all .txt files to support that are created in your index directory (or look in them for clues).

 

 

About the code

 

The new PDF document parser is contained in the class CustomParser, it is fairly standard IFilter code plus our code for separating strings into words.

To use this class it is necessary to use a CustomParserProvider to create our instance of the CustomParser when working with PDF files - this is done in ExternalEventHandler;

public void dispatcher_NeedObject(object sender, NeedObjectEventArgs e)

{

if (e.RequiredObject is ParserProvider)

{

Keyoti.SearchEngine.DataAccess.Log.WriteLogEntry("AcrobatIFILTER", "Requires parser provider", conf);

e.RequiredObject = new NewWordParser.CustomParserProvider(Path.Combine(e.Configuration.IndexDirectory, "temp.pdf"), e.Configuration);

}

}

If you have questions or problems please email support@keyoti.com


Related Questions:

Attachments: