|
Rank: Member
Groups: Registered
Joined: 3/12/2013 Posts: 20
|
Hey Jim, I think I may have found a small bug. We are using some of the code found on here (and altered it) that allows us to index Meta Keywords and descriptions. When inserting the meta data, we have site specific variables that allow the client to adjust the weighting of the meta data. Everything works perfectly as long as we leave the BoostFactorTagName to the Keyoti default "keyoti_search_weight_boost_factor". If I look at the index, meta words are boosted by the correct factor. Once we change the default (in the configuration.xml) and insert the new tag at run-time, it no longer boost the meta data (defaulting back to 10*1). Please let me know how to resolve the issue! Cheers, -Dan Code below altered as follows: - Constructor passes in 2 new values - metaKeyWordWeight --> int for boost factor - metaDescriptionWeight --> int for boost factor - reference KeyotiVariables.BoostFactorTagName for the reference to the configuration BoostFactorTagName Code: public class ExtendedHtmlDocumentParser : Keyoti.SearchEngine.Documents.HtmlDocumentParser { private static string keywordStartMark = " <!--{0}=\"{1}\"--> "; private static string endMark = " <!--{0}=\"1\"--> "; private static string descStartMark = " <!--{0}=\"{1}\"--> "; public ExtendedHtmlDocumentParser(Configuration c, int metaKeyWordWeight, int metaDescriptionWeight) : base(c) { keywordStartMark = string.Format(keywordStartMark, KeyotiVariables.BoostFactorTagName , metaKeyWordWeight); descStartMark = string.Format(descStartMark,KeyotiVariables.BoostFactorTagName, metaDescriptionWeight); endMark = string.Format(endMark, KeyotiVariables.BoostFactorTagName); }
public override DocumentText Read(System.IO.Stream stream, Uri uri, Encoding encoding) { //To read the meta tags, we need a copy of the document. MemoryStream peakStream = CopyStream(stream); StreamReader peakReader = new StreamReader(peakStream); string documentContent = peakReader.ReadToEnd(); peakReader.Close();
//Now read the meta tags Hashtable _metaTable = ReadMetaTags(documentContent);
//Add meta contents to a new stream MemoryStream modifiedStream = new MemoryStream(); StreamWriter modWriter = new StreamWriter(modifiedStream); modWriter.Write(documentContent + " "); if (_metaTable["keywords"] != null) { modWriter.Write(keywordStartMark); modWriter.Write(_metaTable["keywords"].ToString()); modWriter.Write(endMark); Keyoti.SearchEngine.DataAccess.Log.WriteLogEntry("Plug-in", "Meta keywords:" + _metaTable["keywords"].ToString(), Configuration); } if (_metaTable["description"] != null) { modWriter.Write(descStartMark); modWriter.Write(_metaTable["description"].ToString()); modWriter.Write(endMark); Keyoti.SearchEngine.DataAccess.Log.WriteLogEntry("Plug-in", "Meta description:" + _metaTable["description"].ToString(), Configuration); }
//Reset the stream modWriter.Flush(); modifiedStream.Position = 0;
//And do the usual parsing based on the new stream return base.Read(modifiedStream, uri, encoding); }
MemoryStream CopyStream(Stream stream) { MemoryStream memStream = new MemoryStream(8192); int c; byte[] buf = new byte[4096]; while ((c = stream.Read(buf, 0, buf.Length)) > 0) { memStream.Write(buf, 0, c); }
memStream.Position = 0; stream.Close();
return memStream; }
Hashtable ReadMetaTags(string documentBody) {
Hashtable _metaTable = new Hashtable(); Regex metaReg = new Regex("<\\s*meta[^>]*name=[\"']?([^\"'\\s]*)[^>]*content=[\"']?([^\"']*)", RegexOptions.IgnoreCase); MatchCollection matches = metaReg.Matches(documentBody); if (matches != null) { foreach (Match m in matches) { Group name = m.Groups[1]; Group content = m.Groups[2]; if (name != null && content != null && !_metaTable.ContainsKey(name.Value.ToLower())) _metaTable.Add(name.Value.ToLower(), content.Value); } } return _metaTable; }
}
|
|
Rank: Advanced Member
Groups: Administrators, Registered
Joined: 8/13/2004 Posts: 2,669 Location: Canada
|
Thanks Dan, couple of thoughts 1. KeyotiVariables.BoostFactorTagName, are you sure it's correct? FWIW you can use c.BoostFactorTagName right there. Actually, this is a good point, it would be worth you checking the value of those 2 vars at that point and making sure they are identical. 2. Have you checked the content of modifiedStream to be sure it's what it should be? I can see why it should be, but you know, worth checking. My money would be on something related to #1, because there isn't any hard coded use of the default value for BoostFactorTagName. Otherwise, if you could post some sample values, maybe it would help. Jim -your feedback is helpful to other users, thank you!-your feedback is helpful to other users, thank you!
|
|
Rank: Member
Groups: Registered
Joined: 3/12/2013 Posts: 20
|
Hi Jim. Thanks for the quick reply. We programmatically create the config file (see below) and use KeyotiVariables.BoostFactorTagName in creation (to ensure that there cannot be a type-o. I have looked at the xml file generated and it is using the correct value.
I have stepped through the code and confirmed that everything seems correct. As I mentioned...it works fine when left at the default tag value.
It's worth noting that I am not using this as a plugin, but tying the events at run-time.
Could it have something to do with the lifecycle instantiation of your objects? I have a potentially similar bug with the searchresults control, if I set the SearchResult1.IndexDirectory = "some new index" and also have the property set in the html to a non-existing index...it will not return results. The solution was to delete the indexdirectory property on the html page.
if (generateConfig) { cm.CreateConfigurationXmlWithDefaultSettings(indexDirectoryPath); cm.RetrieveConfiguration(conf); conf.MaximumCrawlDepth = MaximumCrawlDepth; conf.Logging = Logging; conf.CrawlSubdomains = false; conf.RespectsRobotsMetaTags = false; conf.RespectsRobotsTXT = false; conf.BoostFactorTagName = KeyotiVariables.BoostFactorTagName; conf.IgnoreBlockBeginPattern = KeyotiVariables.IgnoreBlockBeginPattern; conf.IgnoreBlockEndPattern = KeyotiVariables.IgnoreBlockEndPattern; conf.CentralEventDispatcher.Action += CentralEventDispatcher_Action; conf.CentralEventDispatcher.NeedObject += CentralEventDispatcher_NeedObject; cm.SaveSettings(conf); }
Could you take another look to see what may be the issue? Thanks!
|
|
Rank: Advanced Member
Groups: Administrators, Registered
Joined: 8/13/2004 Posts: 2,669 Location: Canada
|
If you put a break point at return base.Read(modifiedStream, uri, encoding); and look at "Configuration.BoostFactorTagName", is it what you're expecting? "Configuration" is a property of the base class. -your feedback is helpful to other users, thank you!-your feedback is helpful to other users, thank you!
|
|
Rank: Member
Groups: Registered
Joined: 3/12/2013 Posts: 20
|
Hi Jim. yes it looks correct. I have a screenshot showing the code execution and the watch variables....but I don't see where to upload attachments.
Here are the values: Configuration.BoostFactorTagName "MO_search_weight_boost_factor" string keywordStartMark " <!--MO_search_weight_boost_factor=\"20\"--> " string descStartMark " <!--MO_search_weight_boost_factor=\"15\"--> " string endMark " <!--MO_search_weight_boost_factor=\"1\"--> " string
As I mentioned...it does work if I change the variable in our system back to the keyoti default.
|
|
Rank: Advanced Member
Groups: Administrators, Registered
Joined: 8/13/2004 Posts: 2,669 Location: Canada
|
Ah, seeing "MO" led me to the problem - it is a bug relating to case. If you use mo_search_weight_boost_factor it will work fine I think. Is it OK to use lowercase only? We'll fix in the next version of course and make it properly case insensitive. Best Jim -your feedback is helpful to other users, thank you!-your feedback is helpful to other users, thank you!
|
|
Rank: Member
Groups: Registered
Joined: 3/12/2013 Posts: 20
|
Thanks Jim! I'll try lower casing the tag. Also, did you look at the issue with setting the SearchResult.IndexDirectory at runtime? When it's defined in the HTML, it seems like it does not switch to the new index correctly.
Happy Holidays. -Dan
|
|
Rank: Advanced Member
Groups: Administrators, Registered
Joined: 8/13/2004 Posts: 2,669 Location: Canada
|
Hi Dan, I don't believe there is a specification (I can't find one) for whether ASPX property settings should take precedence over codebehind or vice-versa. I think it's up in the air and dependent on how ASP.NET processes the page/controls tree. For example, CreateChildControls can be called at any time during a control's lifecycle, and that would effect how properties are handled (if the codebehind setting was set during an event before CreateChildControls then it will be overwritten by the ASPX, and vice-versa if it was set after CreateChildControls. So, to be certain you should either not set it in ASPX or set it in code-behind at a later event, LoadComplete is probably safe. http://msdn.microsoft.com/en-us/library/ms178472(v=vs.90).aspx Seasons greets! Jim -your feedback is helpful to other users, thank you!-your feedback is helpful to other users, thank you!
|
|