Importing

The importing operation is the process of finding document URLs and adding them to the index. There are a few different types of 'Indexable Sources' that the engine can import documents from;

[For programmatic importation, please see the programmatic example in the examples section.]

Website

Imports documents/pages from a web-site by crawling through the links found. This works well with most web-sites and in particular is the only way to import sites that are dynamic or use session state and/or HTTP GET requests to load content. The important thing to understand about the crawler is that it sees your web-site like a browser without Javascript capabilities - this means that it is able to preserve session, cookies and GET parameters, however it cannot do ASP.NET postbacks (when a postback is required to reach another page, consider embedding a hidden link to the page, eg. <!-- href="otherpage.aspx" -->

Please see the 'How to Index a Web-Site' example

Since each page must be read on the web-site, through a HTTP request, the crawl process can take a considerable amount of time for very large sites (the amount of time is dependent on network speed, machine speed and number of pages - but could be in the order of hours) - see File-system Document Store for a faster alternative.

Parameters

Start URL
The page at which crawling should start. This page must include links to other pages.

Path matches to be ignored (More Options)
List of strings which specify path matches to be ignored - if a string in this list matches the URL about to be crawled, the URL will be ignored. Eg. if this list contains "/dir1/" then the URL "http://localhost/dir1/file.aspx" will not be crawled.

Path matches to be included (More Options)
List of strings which specify path matches to be exclusively included - only if a string in this list matches the URL about to be crawled, will the URL crawled. Eg. if this list contains "/dir1/" then the URL "http://localhost/dir1/file.aspx" will be crawled but "http://localhost/file.aspx" will not.

Website importing does not work well if documents do not have links to them, for example if there is a repository of PDF and Word documents, that aren't directly linked, the crawler will not find them. In this case it's advisable to supplement the import with "File-system Document Store" (below).

File-system Document Store

Imports documents from a local file path by recursively reading the files under a folder. This is useful when a set of documents/files exist under a web-site but aren't linked to (i.e. crawling fails to find them).

The local folder path should be set to a path which corresponds to a virtual folder (eg. c:\inetpub\wwwroot\mydocs) and the virtual folder path should be set to the URL that the local path corresponds to (eg. http://servername/mydocs). In this way any documents found under c:\inetpub\wwwroot\mydocs are automatically mapped to http://servername/mydocs. The local folder path can be relative to the index directory.

Since this process uses the local file system, it is generally much much faster than crawling a web-site.

Parameters

Local folder path
Is the file path of the folder (eg. c:\ or a UNC share - is relative to the index directory, or absolute)

Virtual folder path
Is the URL of the folder - this is used for search results so that the user can access the documents

Target match list
Is a list of strings that specify the documents that will be imported - this can be set to ".doc" and ".pdf" for example to only import Word and PDF files. To specify strings that must match at the end of the filename, use a $ character, eg. ".aspx$" will match "default.aspx" but not "default.aspx.cs"

Recurse sub-folders
Specifies whether to look in subdirectories as well

No recurse folder match list
Is a list of strings that specify which subfolders will not be looked inside - eg. "windows" would prevent the "c:\windows" directory from being scanned for files

Database (MSSQL, OLE, Oracle)

Imports rows from a database. By specifying the DB connection string and a SQL query, the engine can import rows from the database. The SQL query should specify the columns that should be imported. The import will merge all columns into one text string which will be indexed. At present the individual fields are not separately searchable.

Parameters

Connection string
Specify the connection string normally used by your applications to connect to the database, eg.
Standard Security:
1. 'Data Source=Your_Server_Name;Initial Catalog=Your_Database_Name;UserId=Your_Username;Password=Your_Password;'
2. 'Server=Your_Server_Name;Database=Your_Database_Name;UserID=Your_Username;Password=Your_Password;Trusted_Connection=False'
Trusted connection:
1. 'Data Source=Your_Server_Name;Initial Catalog=Your_Database_Name;Integrated Security=SSPI;'
2. 'Server=Your_Server_Name;Database=Your_Database_Name;Trusted_Connection=True;'
If the DB is file based, a relative path can be specified (relative to the Index Directory).

SQL query
Specify the SELECT query that will provide data to the indexer. Eg. SELECT * FROM mytable
The Reader class will by default use the first column as the result title and the second column (if it exists) as the static description.
If the data set will be very large, then it is possible to specify a query that uses paging to retrieve the data. Any instances of {0}, {1} and/or {2} in a query will be replaced with the "start row#", "end row #", and "number of rows requested" respectively. In this way the engine will query for 500 rows at a time (The Configuration property, DbImportPageSize, specifies the number of rows that will be requested.).

Eg. if the result set has a unique number based field;
SELECT * FROM Table1 where id>={0} AND id<{1}
will cause the request to be made with SELECT * FROM Table1 where id>=0 AND id<500 and then again with 500 and 1000, and so on until no more data is returned from the DB.

Eg. if the result set does NOT have a unique number based field;
SELECT * FROM( SELECT top {2} * FROM ( SELECT top {2} * FROM ( SELECT top {1} * FROM Table1 ORDER BY id ASC) a ORDER BY id DESC ) b ) d ORDER BY id ASC
will achieve the same thing.

Unique field
Specify the name of the column that holds unique, record identifying data (eg. the primary key). This column will added to the ResultURL so that your record viewer can present the record.

Result URL format
This format string defines how the result URLs will be created. When the user searches, the result links will be created based on this format. {0} is converted to the unique field value and {1} is converted to the unique field name. This allows your record viewer to identify which record result the user clicked on.

Custom DataSet Provider

Imports rows from a DataSet object, provided by a dynamically loaded 3rd party assembly. Is similar to database importing, except that the DataSet is returned by a user assembly. This is of use when importing data from any kind of unsupported source, or from a database where the data retrieval needs to be specially handled (i.e. not by the database importer). All data in the first DataTable in the DataSet is indexed, where each row corresponds to a searchable 'result'.

Parameters

Assembly path
You can create an assembly that will provide a DataSet for indexing. Specify the full file path of your assembly here. Eg. c:\myapp\my.dll

Full class name
Specify the full class name (including namespace) of the class that will provide the DataSet - eg. mynamespace.myclass

Unique field
Specify the name of the DataSet column that holds unique, record identifying data (eg. the primary key). This column will added to the ResultURL so that your record viewer can present the record.

Result URL format
This format string defines how the result URLs will be created. When the user searches, the result links will be created based on this format. {0} is converted to the unique field value and {1} is converted to the unique field name. This allows your record viewer to identify which record result the user clicked on.

The class specified in "Full class name" must contain at least one method, named GetDataSet with signature

//C#
public DataSet GetDataSet(int firstRow, int numberOfRows);

'VB.NET
Public Function GetDataSet(ByVal firstRow As Integer, ByVal numberOfRows As Integer) As DataSet

This method must return a DataSet with the data from row number 'firstRow' up to a maximum of 'firstRow' + 'numberOfRows'. If no data is available then null/Nothing must be returned. The Configuration property, DbImportPageSize, specifies the 'numberOfRows' that will be requested.

Tips

CAS (security permission) issues:

If accessing an external assembly (ie. CustomDataSetProvider or ExternalEventHandler) in anything below Full Trust, the external assembly must be located under the application directory.

External Assembly Location Tip:

When working with a plug-in DLL and the Admin Web App., place the plug-in DLL in the application BIN directory. This will make the DLL automatically updatable, without stopping the IIS service or killing the ASP.NET worker process.
If the plug-in is outside the application and it's subdirectories, ensure the ASPNET (IIS 5) or NetworkService (IIS 6) user can access the DLL, eg. run from command prompt

cacls "path to plug-in DLL" /E /G ASPNET:R

Relative Paths

Paths to files for Database, CustomDataSetProvider and EventHandlerAssemblyPath imports can be relative to the IndexDirectory making projects easier to move around and share.

For example, the connection string for an Access database may be;

Provider=Microsoft.Jet.OLEDB.4.0; Data Source=C:\Inetpub\wwwroot\tempTestFiles\test.mdb;

If the Access database is in the parent of the index dir, you can instead use;

Provider=Microsoft.Jet.OLEDB.4.0; Data Source=..\test.mdb;