If for example a support forum on the site has numerous duplicate pages with different URLs (reply links, reply with quote links) then it is desirable to exclude the duplicates from the index at the crawling stage.
Suppose that the links are as follows:http://localhost/forum.aspx?action=reply&mesgid=1
http://localhost/forum.aspx?action=view&mesgid=1
http://localhost/forum.aspx?action=reply&mesgid=2
http://localhost/forum.aspx?action=view&mesgid=2
and that the 'reply' pages contain the same text as the 'view' pages. To prevent the 'reply' pages from being added to the index at the crawl stage we can add an element to the 'Path matches to be ignored' collection; "forum.aspx?action=reply" - which will prevent any URLs that contain that text from being added. It is a collection and can contain multiple URL segments to be ignored.
Note, if an index has already been created, setting this will have no effect on documents already in the index, so if necessary, delete the index and recrawl.
<meta name="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"/>
To ignore the robots meta tag, set RespectsRobotsMetaTags to false in the configuration.
DocumentIndex imp = new DocumentIndex(Configuration); urlStrings = imp.Import(new WebsiteBasedIndexableSourceRecord(startUrlString, pathMatchesToBeIgnoredList, pathMatchesToBeIncludedList)); imp.Close();
(pathMatchesToBeIgnoredList and pathMatchesToBeIncludedList can be null/Nothing)