Title Back Colour Keyoti Title Line Title Curve
Blue Box Top

Programmatic indexing of files - SearchUnit - Forum

Welcome Guest Search | Active Topics | Log In | Register

Options
bill
#1 Posted : Wednesday, May 21, 2014 1:22:21 PM
Rank: Advanced Member

Groups: Registered

Joined: 2/29/2008
Posts: 43
I am indexing a content management system that has content stored in a database, and file attachments stored on disk. For various reasons I need to do the indexing programmatically rather than having the indexer crawl the site.

I am using PreloadedDocument to add the database content and that works fine. Now I also want to index the attachments. What is the best way to do that? Do I need to open each file, determine its type, find the right parser to read its content, and add it using PreloadedDocument? Or is there a simpler way to let the indexer do that work for me?
Jim
#2 Posted : Wednesday, May 21, 2014 1:57:50 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Hi, yes you do need to open each file and determine it's mime type. Then you can use our API to obtain the correct parser. So this code will work, provided you can figure out the mime type.

Code:

ParserProvider parserProvider = new ParserProvider(configuration);
Parser parser = parserProvider.GetParser( fileMimeType );
DocumentText dt = parser.Read();

StringBuilder text = new StringBuilder();
foreach(Word w in dt.Words){
text.Append(w.WordContent+" ");
}


then put text.To String() into a PreloadedDocument.

Actually you can get the mime type for a file extension from our API too
Code:

string fileMimeType =configuration.FileTypesSettings[ fileExtension.ToUpper() ]; //where fileExtension is a string like "PDF" or "xls"

Best
Jim

-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


bill
#3 Posted : Thursday, May 22, 2014 10:35:06 AM
Rank: Advanced Member

Groups: Registered

Joined: 2/29/2008
Posts: 43
Thanks. I'll give that a try.
darylteo
#4 Posted : Wednesday, September 24, 2014 3:26:12 AM
Rank: Member

Groups: Registered

Joined: 9/24/2014
Posts: 7
Sorry to hijack but I am evaluating Keyoti Search for ASP.NET for my client and I have a similar use case. I have a single page, details which are stored in DB, and file attachments which are stored on Disk.

However, I need the entire thing to be treated as 1 document (so, the DB meta + content of all file attachments must be indexed as a single document.

Based off of the tip you provided above, I should be able to just add everything into a single text string, put into a PreloadedDocument, then queue it for indexing.

Alternatively, I let Keyoti index all the relevant files, then reverse the filename of the file attachments into the report urls but this would need to be a db lookup for every result. In this case, would Keyoti then be able to search on keywords present in both the db meta and the file attachments?

Is this correct, or can you please recommend a better way as the documentation is very lacking and I am guessing at this point.

Thank You,
Daryl
Jim
#5 Posted : Wednesday, September 24, 2014 3:18:34 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Thanks for posting, Daryl. The PreloadedDocument approach is a good one - it seems less fragile to me, in that you don't have to worry about pointing attachments or meta back to the document.

I'd be happy to expand out the PreloadedDocument code I posted above - it would help if you could post some code showing how you would loop through the records in your DB and obtain the meta, content and attachments. A bit of context makes snippets more useful. Also, how do determine the URL that you want the result to point to (is it in the DB)?

Best
Jim




-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


darylteo
#6 Posted : Wednesday, September 24, 2014 11:10:58 PM
Rank: Member

Groups: Registered

Joined: 9/24/2014
Posts: 7
The data for each row will be in the following format:

ID, c1, c2, etc., path1,path2,path3

Each path will be a physical file path, with file extensions (e.g. pdf, doc, docx, xls, xlsx etc.)

It sounds like using PreloadedDocument would be pretty straight forward in this case. Would there be a need to delimit the different parts when they are put into the search index? For example: I see below that you simply concat all the words together. Would, say, the last word of the first document then be matched as a phrase with the first word of the 2nd document?
darylteo
#7 Posted : Wednesday, September 24, 2014 11:12:45 PM
Rank: Member

Groups: Registered

Joined: 9/24/2014
Posts: 7
More accurately (since I can't edit my post)

"ID", "c1", "c2", ... , "path1,path2,path3"

The paths will simply be a comma-delimited set of paths. Easy to just split.
Jim
#8 Posted : Thursday, September 25, 2014 2:23:10 AM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Thanks, by the way you can edit your post by clicking the pencil icon.

quote:
Would there be a need to delimit the different parts when they are put into the search index? For example: I see below that you simply concat all the words together. Would, say, the last word of the first document then be matched as a phrase with the first word of the 2nd document?



Yes you're correct. You could do

text.Append(" * ");

between the docs, to prevent them being considered a phrase.


So, I think what you'll need to do is this;

Code:

//get main content
StringBuilder text = new StringBuilder();
text.Append( metaDataAsString );
text.Append(" * ");
text.Append( mainContentAsString );
text.Append(" * ");

foreach(string attachmentPath in attachments){

string fileMimeType =configuration.FileTypesSettings[ Path.GetExtension(attachmentPath).ToUpper() ];

ParserProvider parserProvider = new ParserProvider(configuration);
Parser parser = parserProvider.GetParser( fileMimeType );
Stream attachmentStream = ...obtain a stream of the attachment somehow...;
DocumentText dt = parser.Read( attachmentStream, new Uri("http://dummy"), null);


foreach(Word w in dt.Words){
text.Append(w.WordContent+" ");
}
text.Append(" * ");
}

//index 'text'
string customData  = null;
string uri = "http://....whatever URL you want for the result link....";
documentIndex.AddDocument(new PreloadedDocument(new Uri(uri), title, text, summary, null, null, null, customData, configuration));



Info about programmatic indexing and PreloadedDocument.
http://keyoti.com/produc...ammatic%20Importing.htm

I hope that helps - let me know if anything is unclear, and how it goes please.
Jim

-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


darylteo
#9 Posted : Thursday, September 25, 2014 11:05:50 PM
Rank: Member

Groups: Registered

Joined: 9/24/2014
Posts: 7
That looks good.

Using this method, in the event that a attachment is removed, I am under the assumption that I would then need to reindex the entire report yes?
Jim
#10 Posted : Friday, September 26, 2014 12:36:03 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
That's right, you would.

-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


darylteo
#11 Posted : Wednesday, October 1, 2014 3:30:39 AM
Rank: Member

Groups: Registered

Joined: 9/24/2014
Posts: 7
Hi Jim,

I'm having some success with the method above.

Couple of the places I'm having difficulty with:

- some of the fields are rich text (i.e. html). I tried to make a MemoryStream out of the data, and passing it as text/html. This works, but I always get several slashes at the end. My testing seems to indicate that it isn't a html issue (also happens when parsing plain text with text/plain). FileStreams don't seem to exhibit this behaviour. Any ideas what this could be? I've even suspected it was a encoding issue, but I can't get it out properly even with System.Encoding.UTF8.

- in your example, you simply use WordContent, and append a space after it. However, my testing indicates that the parsers return whitespace as Words as well. This is despite there being a stoplist.txt that was generated by the launcher and copied across. Is this intended behaviour or is this a configuration thing? Should I be including a pattern test to match only [a-zA-Z0-9]+?

Code:

private static string HtmlToString(string data, Encoding encoding = null) {
    return DataToString(data, "text/html", null);
}
private static string DataToString(string data, string mime = null, Encoding encoding = null) {
    encoding = encoding ?? System.Text.Encoding.UTF8;
    mime = mime ?? "text/plain";

    using(var stream = new MemoryStream(encoding.GetBytes(data))){
        return DataToString(stream, mime, null);
    }
}

private static string DataToString(Stream stream, string mime, Encoding encoding) {
    var builder = new StringBuilder();

    var parserProvider = new ParserProvider(SearchEngineFunctions.Configuration);
    var parser = parserProvider.GetParser(mime);

    var dt = parser.Read(stream, new Uri("http://fake"), null);

    foreach(var word in dt.Words) {
        builder.Append(word.WordContent);
    }

    return builder.ToString();
}


After joining with other strings, and searching, ResultItem.Content gives me the following. Note the slashes at the end of Report Abstract (rich text), and the spaces between Report Abstract/Report Text even when I did not append any spaces.

Code:
Daryl's Report of Things * Daryl * Daryl * 2014-10-01 * 2014-10-02 * Yart * $1000 * Yart *
Report Abstract  / / *
Report Text Yart


Edit: Almost definitely an encoding issue :( the slashes only appear when something else is appended to the string.

Edit: okay I think my string has some unicode spaces which aren't being treated as whitespace.

Regards,
Daryl
Jim
#12 Posted : Wednesday, October 1, 2014 2:38:36 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Daryl, I'm glad you were able to resolve the issues. Sorry my code

text.Append(w.WordContent+" ");

was unnecessary, just text.Append(w.WordContent); is fine as you noted.

Generally extraneous whitespace isn't a problem because it doesn't affect the search, and also when viewed with HTML, multiple whitespace collapses to single whitespace.

Jim

-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


darylteo
#13 Posted : Tuesday, October 7, 2014 1:12:47 AM
Rank: Member

Groups: Registered

Joined: 9/24/2014
Posts: 7
Hi Jim,

could you attempt to parse this http://arrb.com.au/admin...ices_1341534810296.pdf?

From what I am doing, I get

- headers split into individual characters.
- multiple instances of ARRB parsed as ARB.

Is this merely an issue with the PDF parser?
Jim
#14 Posted : Tuesday, October 7, 2014 2:34:01 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Thanks for the link, I would say 95% of it is correct, but yes the parser is where the spacing problems are arising. Unfortunately PDFs were designed to be one way, so reversing them isn't always perfect. That said you could try the Adobe IFilter if you want (let me know).

Here's the text that came back from the parser (the forum seems to be struggling with the bullets, but they were ok in the text out from the parser):

quote:

About ARRB:
ARRB Group Ltd (ARRB) provides information services to the road research, consulting and applies research outcomes to and transport industry. ARRB road and traffic information and develop equipment that collects software that assists with decision ARRB is the leading provider of making across road networks. road research and best practice workshops in Australia.

Services

Q u a lit y assu r a nce
As a centre for road infrastructure expertise, ARRB has been developing and offering services for over 30 years. These services have been fundamental in improving road quality in Australia.
ARRB is at the forefront of service provision and is accredited with ISO 901:208.

ARRB Group Ltd ABN 68 004 620 651 www.arrb.com.au

ARRB's two laboratories (in Melbourne and Perth) both comply with the requirements of ISO/IEC 17025:205, with ARRB maintaining the accreditation since May 20. In 208, ARRB established an FWD calibration facility operating under the FHWA FWD Calibration Protocol 207. ARRB has the ability to calibrate the full range of FWD equipment.

MATERIALS, PAVEMENT & CONCRETE DESIGN & TESTING | INFRASTRUCTURE AS ET MANAGEMENT | SAFE SYSTEMS | L A N D T R A N S P O RT RE S O U R C E S & I N F O RM A T I O N | ROA D S A F E T Y |

TRAF IC ENGINE RING |

EQUIPMENT MANUFACTURE & DATA COL ECTION SERVICES | T R A N S P O RT P O L I C Y , OP E R A T I O N S & E C O N O M I C S | KNOWLEDGE TRANSFER & CAPACITY BUILDING | HE A V Y V E H I C L E T E S T I N G & S I M U L A T I O N |

PARKING |

In 209, the entire scope of ARRB's Services department was externally audited to the requirements of ISO 901:208. ARRB's certification number is 14029 and covers the suply of equipment and associated services.

Head Office:
Australia, Victoria: 500 Burwood Highway Vermont South VIC 3133 Australia P: +61 3 9881 1555 F: +61 3 9887 8104 info@arrb.com.au Luxmoore Parking Consulting Ground Floor, 12 Wellington Parade East Melbourne, VIC 3002 Australia P: +61 3 9417 5277 F: +61 3 9416 2602 www.luxmooreparking.com.au

Regional Offices:
Western Australia: 191 Carr Place Leederville WA 6007 Australia P: +61 8 9227 3000 F: +61 8 9227 3030 arrb.wa@arrb.com.au Queensland: 123 Sandgate Road Albion QLD 4010 Australia P: +61 7 3260 3500 F: +61 7 3862 4699 arrb.qld@arrb.com.au South Australia: 121 King William Street Adelaide SA 5000 Australia P: +61 8 7200 2659 F: +61 8 8423 4500 arrb.sa@arrb.com.au New South Wales: 2-14 Mountain St Ultimo NSW 2007 Australia P: +61 2 9282 4444 F: +61 2 9280 4430 arrb.nsw@arrb.com.au

China: Floor 13, Zhen Xing Building 118 North Hu Bin Road Xiamen PRC P: +86 592 2135 552 F: +86 592 2136 663 www.arrb-china.com.cn

Trusted advisor to road authorities for technical input and solutions

June 2012

R o ad condition and inventory survey s

P a vem e nt streng t h testing

O t her services

Whether your need is for a large-scale network survey, a local road system survey or ride quality test, ARB's Services department offers quality data collection, using ARB's own Hawkeye platform.

ARB's pavement strength testing services are carried out using a Falling Weight Deflectometer (FWD), which is a non-destructive testing device that provides data on the bearing capacity of road and airport pavements.

In partnership with our colleagues and in-house development team, ARB's services department have developed complementary options to cater to the specific needs of our clients.

ARB has built up extensive network survey experience over its many years of operation. This experience translates into the provision of accurate, reliable and time data, in accordance with national and international standards.

Suitable for highways, local roads, railways and airport runways, FWD testing allows for more accurate and rapid measurement of pavement deflection under loads than traditional methods.

Ride Quality Testing
ARB offers independent assessments of the surface characteristics during the post construction of pavements. Our testing is undertaken in accordance with the prevailing jurisdiction's test method, e.g. VicRoads or RTA.

ARB maintains a fleet of over 10 dedicated survey vehicles, with various configurations to meet our clients' requirements, that can be driven anywhere in Australia for various types of data collection.

The data can assist in applications such as pavement overlay design, pavement condition surveys and in the development and operation of a Pavement Management System (PMS).

With trained survey operators located in Melbourne, New South Wales, Queensland and Perth, ARB Services has people on hand to provide quality data collection assistance for your next project.

A dynamic load is generated by the dropping of a mass, similar to that of a moving vehicle or aircraft wheel loads, from a pre-set height. The magnitude of the load and the pavement response are measured by a load cell and nine geophones.

ARB also routinely undertakes pre and post construction dilapidation surveys related to the creation of major infrastructure so as to provide an independent assessment of the impact on public roads.

ARB Services is backed by an extensive support team, with over 20 years experience and knowledge in the data collection field.

The FWD is equipped with DGPS, thus providing location information up to an accuracy of 1 metre and all data collected is in accordance with International Standard ASTM D 469- (203).

Applications
· Post maintenance / construction ride quality testing · Independent laser and imaging surveys for baseline data for infrastructure projects

Services offered

Features

· Automated pavement surface assessments

· Non-destructive testing of pavements with testing between 7 kN to > 240 kN

Asset Management

· Geometry and mapping surveys

· Nine configurable geophone arrangements to suit rigid pavement testing and flexible pavements

· Roadside inventory and asset management

· Rigid and articulated pavement testing plates with both 30 m and 450 m diameter to suit most applications

ARB has the capability to assist your organisation with effective management of your road and bridge assets.

· Road safety assessment · Airport runway maintenance

· Outputs maximum deflection, curvature, multiple temperature sensor measurement, surface modulus, spatial and linear referencing and other client requirements

· Advisory speed and travel times

· Data is collected on time and within budget

· Road construction quality testing

· Post survey analysis undertaken in EFROMD3, ELMOD, ROSY and Excel

Our unique road and bridge asset management services provide innovative customer focussed solutions making it possible for your network manager to effectively manage the network to achieve best value.

· Line reflectivity surveys

Features

ARB has a wealth of experience in providing asset management solutions, highlighted by the scope of our clients which range from local government through to statelevel, national and international organisations.

· Multiple vehicle configurations for customised surveying options · Outputs include longitudinal profile, transverse profile, roughness, rutting, texture, cross-fall, slope, grade and both asset and pavement images · All data parameters collected in a single pass

Having the right information is the first step in best practice asset infrastructure management. This information can be analysed to optimise road and bridge expenditure using long term, soundly based decision tools and expertise of ARB.

· Safe and · Efficient and timely collection of data in both urban and highway environments · Trained professionals available for customised visual rating and post survey processing (completed using Hawkeye Processing Toolkit)

SCALEABLE

SURVEY

SOLUTIONS

www.arrb.com.au

Trusted advisor to road authorities for technical input and solutions






Jim

-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


darylteo
#15 Posted : Tuesday, October 7, 2014 10:31:59 PM
Rank: Member

Groups: Registered

Joined: 9/24/2014
Posts: 7
It's dropping letters from the name of the company =) hope it won't be a huge issue for them.
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.




About | Contact | Site Map | Privacy Policy

Copyright © 2002- Keyoti Inc.