Wednesday, July 04, 2007

Lucene Support in Tally-Ho

I've used Lucene for a really long time now. Practically since its Java-based inception, has searched articles using Lucene. My initial effort to add Lucene support to Tally-Ho failed miserably due to some unexpected behaviour from Lucene that I resolved by zeroing in on the problem with more unit tests.

Since all Article manipulation for the site goes through the ArticleService, that makes it a very handy place to automatically index Articles as they are created and updated and as they go through the lifecycle (from submitted, to approved, accepted, and so-on).

Adding an article to the index is trivial. We create a Directory object that points to where we'd like Lucene to write its files, create and IndexWriter to write them there, create several Field objects to represent the names of fields and their contents that we'd like to search on, add those fields to a Document, and add the Document to the IndexWriter. We can then query on any combination of these fields. Great.

I had a problem come when it becomes necessary to *change* an article. Lucene does this via the updateDocument method, or you can call deleteDocument and addDocument yourself. The advantage to updateDocument is that it's atomic. But for me, neither strategy worked at first.

First of all, even though Lucene said it was performing a delete based on a Term (which in our case contains the primary key of the Article), it didn't actually do it unless the Field referenced by the Term was stored as Field.Index.UN_TOKENIZED. If I stored it TOKENIZED, Lucene claims to be deleting, but the deleted Document would still show up in search queries.

Secondly, when I tried to delete a document, it looked like I could never add another document with the same fields ever again.

The first case turned out to be caused by using the StopAnalyzer to tokenize the input. When you index a term as UN_TOKENIZED, Lucene skips the Analyzer when storing the term to the index. The StopAnalyzer tokenizes only sequences of letters. Numbers are ignored. This differs from the StandardAnalyzer, which also uses stop words, which tokenizes letters as well as numbers. Since we delete based on the id term, which is numeric, Lucene was never finding the document it was supposed to delete, as the term had been tokenized into nothing by the StopAnalyzer... so the old document was not found and consequently not deleted.

The second case turned out to be caused by a fault in my unit test. I was doing an assertion that an updated article was in the database by doing a search on its id field. But I didn't assert that it was there before the update by searching the same way. For this reason it appeared that the article disappeared from the database and stayed away, because other unit tests worked (but those other tests also searched on other terms). Once I realized that the search on id was always failing, everything began to fall in place. Note also that you specify a tokenizer on search queries as well, so even when I stored the id term as UN_TOKENIZED, the StopAnalizer applied to the query would effectively eliminate the value of the search term (such that it could only ever find documents that had an empty id).

Lucene 2.2 has a great feature that lets you find documents in its index that are similar to a given document. The given document doesn't even need to be in the index, but it's very easy to do if it is. Since Tally-Ho automatically includes articles in the index as soon as they are created, this case applies. The code is very simple:

directory = FSDirectory.getDirectory(indexDirectory);
reader =;
searcher = new IndexSearcher(directory);

MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"combined"});

TermQuery findArticle = new TermQuery(new Term("id", String.valueOf(id)));

Hits hits =;
int luceneDocumentId =; query =;
hits =;

I probably should be checking the first call to to make sure the article for comparison is found (it should always be found, but sometime strange things happen).

Labels: ,

Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]