Title Back Colour Keyoti Title Line Title Curve
Blue Box Top

Bug with changing the BoostFactorTagName - SearchUnit - Forum

Welcome Guest Search | Active Topics | Log In | Register

Options
DanatWork
#1 Posted : Sunday, December 8, 2013 6:11:42 PM
Rank: Member

Groups: Registered

Joined: 3/12/2013
Posts: 20
Hey Jim,
I think I may have found a small bug.
We are using some of the code found on here (and altered it) that allows us to index Meta Keywords and descriptions. When inserting the meta data, we have site specific variables that allow the client to adjust the weighting of the meta data.

Everything works perfectly as long as we leave the BoostFactorTagName to the Keyoti default "keyoti_search_weight_boost_factor". If I look at the index, meta words are boosted by the correct factor.

Once we change the default (in the configuration.xml) and insert the new tag at run-time, it no longer boost the meta data (defaulting back to 10*1).

Please let me know how to resolve the issue!

Cheers,
-Dan

Code below altered as follows:
- Constructor passes in 2 new values
- metaKeyWordWeight --> int for boost factor
- metaDescriptionWeight --> int for boost factor
- reference KeyotiVariables.BoostFactorTagName for the reference to the configuration BoostFactorTagName

Code:

public class ExtendedHtmlDocumentParser : Keyoti.SearchEngine.Documents.HtmlDocumentParser
{
    private static string keywordStartMark = " <!--{0}=\"{1}\"--> ";
    private static string endMark = " <!--{0}=\"1\"--> ";
    private static string descStartMark = " <!--{0}=\"{1}\"--> ";
   
    public ExtendedHtmlDocumentParser(Configuration c, int metaKeyWordWeight, int metaDescriptionWeight) : base(c)
    {
        keywordStartMark = string.Format(keywordStartMark, KeyotiVariables.BoostFactorTagName , metaKeyWordWeight);
        descStartMark = string.Format(descStartMark,KeyotiVariables.BoostFactorTagName, metaDescriptionWeight);
        endMark = string.Format(endMark, KeyotiVariables.BoostFactorTagName);
    }

    public override DocumentText Read(System.IO.Stream stream, Uri uri, Encoding encoding)
    {
        //To read the meta tags, we need a copy of the document.
        MemoryStream peakStream = CopyStream(stream);
        StreamReader peakReader = new StreamReader(peakStream);
        string documentContent = peakReader.ReadToEnd();
        peakReader.Close();

        //Now read the meta tags
        Hashtable _metaTable = ReadMetaTags(documentContent);

        //Add meta contents to a new stream
        MemoryStream modifiedStream = new MemoryStream();
        StreamWriter modWriter = new StreamWriter(modifiedStream);
        modWriter.Write(documentContent + " ");
        if (_metaTable["keywords"] != null)
        {
            modWriter.Write(keywordStartMark);
            modWriter.Write(_metaTable["keywords"].ToString());
            modWriter.Write(endMark);
            Keyoti.SearchEngine.DataAccess.Log.WriteLogEntry("Plug-in", "Meta keywords:" + _metaTable["keywords"].ToString(), Configuration);
        }
        if (_metaTable["description"] != null)
        {
            modWriter.Write(descStartMark);
            modWriter.Write(_metaTable["description"].ToString());
            modWriter.Write(endMark);
            Keyoti.SearchEngine.DataAccess.Log.WriteLogEntry("Plug-in", "Meta description:" + _metaTable["description"].ToString(), Configuration);
        }




        //Reset the stream
        modWriter.Flush();
        modifiedStream.Position = 0;

        //And do the usual parsing based on the new stream
        return base.Read(modifiedStream, uri, encoding);
    }

    MemoryStream CopyStream(Stream stream)
    {
        MemoryStream memStream = new MemoryStream(8192);
        int c;
        byte[] buf = new byte[4096];
        while ((c = stream.Read(buf, 0, buf.Length)) > 0)
        {
            memStream.Write(buf, 0, c);
        }

        memStream.Position = 0;
        stream.Close();

        return memStream;
    }

    Hashtable ReadMetaTags(string documentBody)
    {

        Hashtable _metaTable = new Hashtable();
        Regex metaReg = new Regex("<\\s*meta[^>]*name=[\"']?([^\"'\\s]*)[^>]*content=[\"']?([^\"']*)", RegexOptions.IgnoreCase);
        MatchCollection matches = metaReg.Matches(documentBody);
        if (matches != null)
        {
            foreach (Match m in matches)
            {
                Group name = m.Groups[1];
                Group content = m.Groups[2];
                if (name != null && content != null && !_metaTable.ContainsKey(name.Value.ToLower()))
                    _metaTable.Add(name.Value.ToLower(), content.Value);
            }
        }
        return _metaTable;
    }

}

Jim
#2 Posted : Monday, December 9, 2013 9:45:53 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Thanks Dan, couple of thoughts

1. KeyotiVariables.BoostFactorTagName, are you sure it's correct? FWIW you can use c.BoostFactorTagName right there. Actually, this is a good point, it would be worth you checking the value of those 2 vars at that point and making sure they are identical.

2. Have you checked the content of modifiedStream to be sure it's what it should be? I can see why it should be, but you know, worth checking.

My money would be on something related to #1, because there isn't any hard coded use of the default value for BoostFactorTagName.

Otherwise, if you could post some sample values, maybe it would help.

Jim

-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


DanatWork
#3 Posted : Tuesday, December 10, 2013 3:17:14 PM
Rank: Member

Groups: Registered

Joined: 3/12/2013
Posts: 20
Hi Jim.
Thanks for the quick reply. We programmatically create the config file (see below) and use KeyotiVariables.BoostFactorTagName in creation (to ensure that there cannot be a type-o.
I have looked at the xml file generated and it is using the correct value.

I have stepped through the code and confirmed that everything seems correct. As I mentioned...it works fine when left at the default tag value.

It's worth noting that I am not using this as a plugin, but tying the events at run-time.

Could it have something to do with the lifecycle instantiation of your objects? I have a potentially similar bug with the searchresults control, if I set the SearchResult1.IndexDirectory = "some new index" and also have the property set in the html to a non-existing index...it will not return results. The solution was to delete the indexdirectory property on the html page.

if (generateConfig)
{
cm.CreateConfigurationXmlWithDefaultSettings(indexDirectoryPath);
cm.RetrieveConfiguration(conf);
conf.MaximumCrawlDepth = MaximumCrawlDepth;
conf.Logging = Logging;
conf.CrawlSubdomains = false;
conf.RespectsRobotsMetaTags = false;
conf.RespectsRobotsTXT = false;
conf.BoostFactorTagName = KeyotiVariables.BoostFactorTagName;
conf.IgnoreBlockBeginPattern = KeyotiVariables.IgnoreBlockBeginPattern;
conf.IgnoreBlockEndPattern = KeyotiVariables.IgnoreBlockEndPattern;
conf.CentralEventDispatcher.Action += CentralEventDispatcher_Action;
conf.CentralEventDispatcher.NeedObject += CentralEventDispatcher_NeedObject;
cm.SaveSettings(conf);
}

Could you take another look to see what may be the issue?
Thanks!
Jim
#4 Posted : Tuesday, December 10, 2013 4:10:31 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
If you put a break point at

return base.Read(modifiedStream, uri, encoding);

and look at "Configuration.BoostFactorTagName", is it what you're expecting? "Configuration" is a property of the base class.




-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


DanatWork
#5 Posted : Tuesday, December 10, 2013 6:06:14 PM
Rank: Member

Groups: Registered

Joined: 3/12/2013
Posts: 20
Hi Jim. yes it looks correct.
I have a screenshot showing the code execution and the watch variables....but I don't see where to upload attachments.

Here are the values:
Configuration.BoostFactorTagName "MO_search_weight_boost_factor" string
keywordStartMark " <!--MO_search_weight_boost_factor=\"20\"--> " string
descStartMark " <!--MO_search_weight_boost_factor=\"15\"--> " string
endMark " <!--MO_search_weight_boost_factor=\"1\"--> " string

As I mentioned...it does work if I change the variable in our system back to the keyoti default.


Jim
#6 Posted : Tuesday, December 10, 2013 8:38:42 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Ah, seeing "MO" led me to the problem - it is a bug relating to case. If you use mo_search_weight_boost_factor it will work fine I think.

Is it OK to use lowercase only? We'll fix in the next version of course and make it properly case insensitive.

Best
Jim

-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


DanatWork
#7 Posted : Thursday, December 12, 2013 10:07:34 AM
Rank: Member

Groups: Registered

Joined: 3/12/2013
Posts: 20
Thanks Jim! I'll try lower casing the tag.
Also, did you look at the issue with setting the SearchResult.IndexDirectory at runtime? When it's defined in the HTML, it seems like it does not switch to the new index correctly.

Happy Holidays.
-Dan
Jim
#8 Posted : Thursday, December 12, 2013 10:16:48 PM
Rank: Advanced Member

Groups: Administrators, Registered

Joined: 8/13/2004
Posts: 2,667
Location: Canada
Hi Dan, I don't believe there is a specification (I can't find one) for whether ASPX property settings should take precedence over codebehind or vice-versa. I think it's up in the air and dependent on how ASP.NET processes the page/controls tree.

For example, CreateChildControls can be called at any time during a control's lifecycle, and that would effect how properties are handled (if the codebehind setting was set during an event before CreateChildControls then it will be overwritten by the ASPX, and vice-versa if it was set after CreateChildControls.

So, to be certain you should either not set it in ASPX or set it in code-behind at a later event, LoadComplete is probably safe.

http://msdn.microsoft.com/en-us/library/ms178472(v=vs.90).aspx

Seasons greets!
Jim

-your feedback is helpful to other users, thank you!

-your feedback is helpful to other users, thank you!


Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.




About | Contact | Site Map | Privacy Policy

Copyright © 2002- Keyoti Inc.