Getting Started with Lucene: Searching your Index


Wednesday, December 8th, 2010 - By Seth Rosen

As humans, we are constantly being bombarded with data throughout our lives. Thanks to the superb filtering and attention abilities of our brains we are able to make sense of all this information. Java programs and web apps need to rely on Lucene for this ability. Using Lucene, apps can now collect information at will, add it to an index, and retrieve whatever information is currently needed quickly and efficiently.

In my last post we learned the basics of creating and modifying a Lucene Index. Now I’ll give you some tips on how to query your index and avoid some of the pitfalls and stumbling blocks I’ve come across.

Searcher

The first step in searching an index is to open the index with a IndexSearcher and IndexReader. These work in conjunction to open the Index in read-only mode, allowing you to safely search it.


IndexReader reader = IndexReader.open(FSDirectory.open(new File(index)), true); // only searching, so read-only=true
Searcher searcher = new IndexSearcher(reader);

If you are not constantly updating your index it is a good idea to only open the IndexSearcher once. This will save you processing time and allow you to start a new Query at any time with the same IndexSearcher. On the other hand, if you Index is constantly being updated you may want to close and re-open your IndexSearcher in order to capture a new snapshot of your index.

Querys and QueryParsers

To run a search on you Index, you need a QueryParser and a Query that can be understood by the IndexSearcher. To create a QueryParser you must instantiate it using the same Analyzer that the documents in your index were created with.


Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, field, analyzer);

In this example the StandardAnalyzer was used to add documents to the index so it is being used again here to retrieve them. The parameter ‘field’ can be a string representing a field name. It specifies the default field that will be used when the query doesn’t explicitly specify a field. QueryParser can also be extended to parse queries in a custom manner.

Just as you can open an IndexSearcher once and keep it open to reduce overhead, you can also instantiate a QueryParser with a default field and an Analyzer and just keep using that same one. But QueryParser isn’t thread-safe, so you would need a separate QueryParser for each thread using the index.

Once we have a parser we can create the most important part, the query! The query is created by passing a search term(s) in the form of a string to the parser.


Query query = parser.parse(queryText);

In this example the search terms are represented by ‘queryText’. The parser will interpret any modifiers in this string and create a corresponding Query that the searcher can understand. For example, if our search string is “sandwich -ham” then the parser will return a Query that will find tokens including “sandwich” but NOT including “ham.”

Results

Once we have a query we can use the searcher to find corresponding results. Depending on your version of Lucene you can use two different methods for returning results. Originally the searcher would return a list of ‘Hits,’ this evolved into a HitCollector and now as of Lucene 3.0, just a Collector. There are plenty of examples showing how to use older versions of Lucene to return and parse Hits so I will focus on Lucene 3.0

To search with Lucene 3.0 you can use an existing collector or create your own. The collector will determine how the results will be returned, sorted, and filtered. A popular existing Collector is the TopScoreDocCollector which returns the top x number of results.


TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] docs = collector.topDocs().scoreDocs;
for (int i = 0; i < docs.length; i++) {
  Document result = isearcher.doc(docs[i].doc);
  System.out.println(result);
}

In this example we are getting the top 10 results and looping through them and printing out the results. Note that each result is a Document and can therefore display either it’s field name, value, or both.

Another method is to create your own Collector and display the data in whichever way you want. This can be slightly more intimidating but can give you much more flexibility.


Collector streamingHitCollector = new Collector() {
private Scorer scorer;
private int docBase;
// simply print docId and score of every matching document
public void collect(int doc) throws IOException {
  System.out.println("doc=" + doc + docBase + " score=" + scorer.score());
}
public boolean acceptsDocsOutOfOrder() { return true; }
public void setNextReader(IndexReader reader, int docBase) throws IOException
{
  this.docBase = docBase;
}
public void setScorer(Scorer scorer) throws IOException {
  this.scorer = scorer;
}
};
searcher.search(query, streamingHitCollector);

Here you can see we have created a collector called ‘streamingHitCollector’ which implements some key methods. Notice that the ‘collect’ method prints out the document and score as it is ‘collected’ by the searcher.

Locking the Index

Whenever you open an index for searching or adding documents, Lucene places a file lock on the index in order to prevent concurrent writing to the same file and to prevent writing to a file while it is being searched. The Lock implementation uses file.delete(), however, which doesn’t check that the lock file is cleaned up. If something happens to prevent the lock file from being deleted (for example if the app ends unexpectedly), a stale lock file can result, causing errors when one tries to access the index. In order to release a locked file in Lucene 3.0 you must do the following:

directory.clearLock(WRITE_LOCK_NAME);

In this example ‘directory’ is the directory of the index that is locked and ‘WRITE_LOCK_NAME’ is the id of the lock. (typically ‘write.lock’ when adding items to an index)

Let us know if you have any other tips or have encountered any specific problems with Lucene.

 

2 Comments

  1. Moritz says:

    Actually the Version.LUCENE_CURRENT constant is deprecated at the current version of Lucene.

    • Seth Rosen says:

      Thanks for the note Moritz, If you are using an older version of Lucene and are including this constant make sure to check the documentation before you upgrade to the newest version.

Leave a Reply