Title Back Colour Keyoti Title Line Title Curve
Blue Box Top

Guidelines on maximum reasonable size for index? - SearchUnit - Forum

Welcome Guest Search | Active Topics | Log In | Register

Options
bill
#1 Posted : Thursday, March 19, 2015 2:25:45 PM
Rank: Advanced Member

Groups: Registered

Joined: 2/29/2008
Posts: 43
I use SearchUnit to index a support ticket management system and we're now considering indexing diagnostic and error logs sent by customers. As a test I indexed about 10% of the total content and ended up with index files totaling about 1 GB after optimization.

Obviously I can just go ahead and test this out at full scale myself but I'm wondering if there are any guidelines on how big the index can get before there are performance problems. Is it solely going to depend on available memory and processing power?

Any other considerations for indexing log files like this? There's going to be a lot in every log file that we don't care about, and a tiny bit of relevant information. Is this likely to overwhelm the index with noise?
Jim
#2 Posted : Thursday, March 19, 2015 4:27:20 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Yes, you're right about the 'noise' - a log file's main source of noise is dates and times, because every instance like 4/5/2015 or 03:20:02 will be put in the lexicon as a separate word.

From the Help:

Quote:

For example, if you have a very large repository of observational data (e.g. text files filled with records from flight, weather data etc that have many different 'word' strings) then indexing it may cause slow downs for certain searches (wildcards) and any index change operations (adding/removing etc). Although Search can handle this type of data, any unnecessary indexing should be avoided if possible.
Hint - the Configuration.IndexNumbers property can also be set false to not index/search for numbers.


If you want numbers to be indexed in some places and not others, then I can show you how to write a plug-in to remove text you don't need indexed (and any other noise) - this would help minimize the index size.

To answer your main point, as you can see it's hard to give specifics about maximum sizes because it depends on the content being indexed (# of unique words, ratio of file length to number of files etc), and also the types of searches being performed (wildcards are hardest-work, down to single keywords being the least-work). There really isn't a substitute for trying it yourself, but my recommendation would definitely be to cull as much noise as you can (regardless of whether the engine can handle the data with the noise or not).

Like I say, if you think there is content you can programmatically remove from the log files, let me know and I'll help you write the plug-in. Another option, could be to remove all success type messages, perhaps?

Best
Jim
-your feedback is helpful to other users, thank you!


bill
#3 Posted : Thursday, March 19, 2015 4:53:12 PM
Rank: Advanced Member

Groups: Registered

Joined: 2/29/2008
Posts: 43
Thanks, Jim. You've confirmed my thought that it's a bad idea to index everything.

I'm already indexing the files programmatically using PreloadedDocument, so I'll give some thought to parsing the files and only adding the relevant bits. Or maybe set it up so we can flag "significant" files that we want added to the search.

Or stick with the current approach, which is to remember to copy the relevant lines from the log and paste them into the case notes where they'll get indexed :)
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.




About | Contact | Site Map | Privacy Policy

Copyright © 2002- Keyoti Inc.