Custom Lucene Scoring


Monday, December 13th, 2010 - By Seth Rosen

When you have a lot of data, finding what you are looking for can be a challenge. Fortunately, Google and other search engines help us make sense of the vast amount of data on the web. But what about your own data? Lucene is a great tool for indexing and searching large amounts of information quickly. Fundamentally it uses a great deal of intelligence to determine which Documents are most important to you based on your query. On the surface Lucene is easy to set up and provides many useful benefits, and taking a peek under the hood can give us insight into the core of Lucene’s powerful features.

I have covered the basics of indexing and searching in Lucene. For those of you interested in the internal workings of Lucene there is no better place to start than its scoring system. Lucene’s brain is its Scoring system. This critical calculation is what determines which results are returned from your searches. Depending on the type of data you are indexing and the purpose of your application you may want to implement a custom method of scoring your data. For instance, when searching data about used cars you may want to put more weight on make and model than, say, color.

Scoring Variables

Lucene’s default scoring system works very well for most cases. It uses seven different variables to determine the final ranking of each document. They are: (from lucenetutorial.com)

  • tf = term frequency in document = measure of how often a term appears in the document
  • idf = inverse document frequency = measure of how often the term appears across the index
  • coord = number of terms in the query that were found in the document
  • lengthNorm = measure of the importance of a term according to the total number of terms in the field
  • queryNorm = normalization factor so that queries can be compared
  • boost (index) = boost of the field at index-time
  • boost (query) = boost of the field at query-time

These factors are fed into the Similarity algorithm, details of which can be found in Lucene’s java-doc and tutorial pages. For the moment I will focus on the simplest method for adjusting scoring: “Boost”.

Boost
Boost increases the weight of a specific field or document so that it will be more likely to be returned by relevant queries. This is a simple way to adjust Lucene’s default behavior; the only tricky part is determining when and where to add boost.

If you are boosting the value of an entire document you’re going to have to set its boost before the document is indexed. document.setBoost(3.5) This will make all the fields in this document 3.5 times more likely to appear in search results.

If you are only interested in increasing the chances of a single field being found by a query then you have two options: you can set the fields boost before it is added to the document, or you can add boost when performing a query. To set the boost during indexing do the following: field.setBoost(1.4) This works in much the same way as document boosting shown above. Remember that field boost will be compounded with any boost given to the document it is added to. Boosts added during indexing are more efficient but if you are searching for data and you want to give a specific Query clause more importance you might want to consider Query boosting. For example if you have a query for “big red chevy truck” you might want to boost “chevy” because make may be the most important.

Query query = parser.parse("chevy");
query.setBoost(2.0);
Query[] query2 = parser.parse("big red truck");

Extending the Similarity Class

More ambitious users may want to implement their own version of Lucene’s Similarity class. This will allow for fine grained control of the algorithm calculating the scores of you documents. You can easily remove any of the variables listed above or add custom calculations to suit your needs. For example:


class IsolationSimilarity
  extends DefaultSimilarity {
    public IsolationSimilarity(){}
    public float idf(int docFreq, int numDocs) {
      return(float)1.0;
    }
    public float coord(int overlap, int maxOverlap) {
      return 1.0f;
    }
    public float lengthNorm(String fieldName, int numTerms) {
      return 1.0f;
    }
}

source

This code shows how by overriding specific Similarity methods you can ignore the calculations done by those sections. In this case by returning ’1′ we are ignoring document frequency (idf), term coordination(coord), and term field normalization(lengthNorm). Therefore in this search we don’t care how much a term appears across the index, how many of the terms in the query were found in in the document, or how many terms are in the field. By ignoring these factors we are ranking documents solely on how often the searched for term appears in the document and any boost on the document/field.


View more Architectural Documentation on Lucene

Thoughts

Experimenting with the scoring system in Lucene may seem intimidating at first, but with a little patience you can customize it to provide much more benefit to your application. Just like each data set and index is unique, each tool used to search them should also be equally unique.

Please let us know if there are any other Lucene topics you would like us to cover.

 

2 Comments

  1. [...] some of the weights. Such examples can be found on LuceneTutorial.com and on blog posts and they all work like this: Java IndexSearcher searcher = new IndexSearcher(reader); [...]

  2. [...] some of the weights. Such examples can be found on LuceneTutorial.com and on blog posts and they all work like this: Java IndexSearcher searcher = new IndexSearcher(reader); [...]

Leave a Reply