Monday, October 29, 2007
The Most Convoluted Problem I Ever Solved
The background: I work on a project called Tally-Ho, a community management system that powers morons.org. It's a massive Java application using the Wicket framework which I run inside Tomcat 6.0.14.
One of the features of Tally Ho is a "story lead" queue where people can submit interesting news articles that they find around the Internet. There is a fairly simplistic URL validator on the URL field of the form which attempts to load the URL supplied by the user, making sure the host resolves, can be connected to, and doesn't return a 404.
The problem was that after a while, instead of resolving hosts, my log would fill up with UnknownHostExceptions. I could do an nslookup from the command line and see that these were legitimate hosts... some of the obviously so, like washingtonpost.com.
It looked like a classic negative caching problem at first, and at first it probably was, in part. In Java, address lookups, both positive and negative, get cached. My initial hypothesis was that some transient lookup failure was causing a negative lookup to be cached, and that the caching was lasting forever, despite my configuring a networkaddress.cache.negative.ttl value in $JAVA_HOME/jre/lib/security/java.security. This hypothesis seemed reasonable in part because I could see by snooping network traffic on port 53 that the JVM was not making any further DNS requests for the hosts in question. Also, restarting the JVM seemed to clear the problem every time, suggesting that once the host was in the nameserver's cache, everything was fine.
I began trying various things including using some old, deprecated Sun-specific caching options. That didn't work. I tried hacking at the InetAddress source code to completely remove its ability to cache. That seemed to work at first, but later the old behaviour somehow returned. Then I discovered using truss and ktrace that my JVM wasn't reading java.security at all, and -Djava.security.debug=properties didn't print anything. I rebuilt the JVM from the FreeBSD port after first removing the entire work directory and indicating that it should use the precompiled "Diablo" JVM to bootstrap the new JVM.
The rebuilt JVM seemed to read java.security, so I figured the problem was solved. Not so. It still happened after Tomcat ran for a while.
I wrote a simple command line tester which attempted a name lookup, waited for a keypress, and then tried the name lookup again. Then I'd restart named, firewall the name servers for that host, and run the test code. I could verify that it retried the host and did not negative-cache it forever when run from the command line. So something was different in what was happening inside Tomcat.
It was then that I noticed that Tomcat was running with its native libraries. I've seen strange things happen before whenever JNI was involved, so I poked around a bit and noticed with ldd that the native libraries had been built with gcc 4.2.1. Knowing that gcc 4.2.1 has serious problems, I rebuilt the native libraries and restarted Tomcat. I repeated the same steps I used in my command line test with my submission form via Tomcat, and saw that things seemed to be working now.
Hours went by, and the same damn exception flew up my log again. What the hell? It was then that I was breathing fire, my entire being converted into pure fury.
I decided to dive a level deeper, running ktrace against the running Tomcat process so I could see every system call it made. One red herring I dispensed with fairly quickly was that in the cases where hosts seemed to resolve properly, the JVM was reading /etc/resolv.conf, and in cases where they didn't it wasn't. But looking at the source code for ResolverConfigurationImpl, it was clear that this was probably due to its internal caching (hard coded to 5 minutes, mind you).
One thing in the kdump did catch my eye though:
31013 jsvc CALL socket(0x2,0x2,0)
31013 jsvc RET socket 1414/0x586
31013 jsvc CALL close(0x586)
That file handle sure seems awfully big for a servlet container with a maximum of 256 simultaneous connections. Somewhere along this time I had also noticed that everything was fine when Tomcat had been recently restarted, but went bad after a while. I had also noticed at some point that caching seemed to have nothing to do with it. Once the failure mode had been entered, it didn't matter what the address was-- the resolver would throw an UnknownHostException for every host, immediately, without ever attempting a lookup to begin with.
So now I had a new hypothesis. That file handle number was awfully high. I was able to develop a test case that demonstrated that name resolution failed as soon as 1024 file handles were in use:
import java.net.*;
import java.util.*;
import java.io.*;
public class Test3 {
public static void main(String[] args) throws Exception {
ArrayList files = new ArrayList(1024);
System.out.println("Opening lots of files");
for(int i=0; i < 1024; i++) {
files.add(new FileInputStream("/dev/null"));
}
System.out.println("Trying to resolve freebsd.org");
InetAddress.getByName("freebsd.org"); // throws exception!
}
}
(It's actually only necessary to open 1020 files; stdin, stdout and stderr bump that number up to 1023, and on the 1024th file handle, it breaks).
My friend Alfred recalled that this is a FreeBSD libc bug, which has been corrected since my fairly ancient compilation of FreeBSD 6.2. At some time in the distant past, some library calls would refuse to cooperate with file handles > 1023 because they couldn't be used with select(2). My test case runs to completion on his host, but always fails with an UnknownHostException on my host. (On Linux it dies and complains about 'too many open files'. Teehee.)
So why was Tomcat leaking all these file descriptors? My first suspicion was the NioConnector, since it's new and known to be a bit buggy. I reconfigured Tomcat to use the older HTTP/1.1 connector. I waited a while, and ran ktrace on the process. No good, it was still using hundreds more file descriptors than it should have.
I decided to run fstat on the Tomcat process, and saw that it wasn't leaked sockets at all, but leaked file descriptors. Fstat, despite what the manual page claims about it showing open files, only shows the inode numbers of open files. (What dork thought that would be useful?) I downloaded and compiled lsof, which actually does list the files being held open by a process.
It was then that I saw the real root of all of this trouble: the directory structure used by Lucene, the search engine used by Tally Ho, was not being closed. Apparently each time the article search form was used, it was leaving a number of files opened. This was easy to fix by correcting a try/catch/finally block in the Java code to ensure that the Directory and IndexReader objects were always closed after use.
So to make a long story short, because the Directory object in the search engine wasn't getting closed, the application was overusing file handles, which was tickling a bug in FreeBSD that prevented socket writes from working correctly, which prevented hostnames from resolving, only after Tomcat had been running for a sufficient time to exhaust 1023 file handles, and this was after correcting a problem with a JVM that didn't read the java.security network address cache settings and a native library that was compiled with a bad version of gcc.
Holy f-ing crap.
The key lessons to learn from all of this are lessons for any debugging experience:
1. Develop a hypothesis, but don't get attached to it. (Your initial hypothesis may be wrong.) Revise your hypothesis as you get closer to the answer.
2. Eliminate unnecessary variables. (Like getting rid of native libraries.)
3. Check and recheck your assumptions. (Is java.security even getting read?)
4. Eliminate red herrings. (Reading resolv.conf has nothing to do with it.)
5. Collect more information any way you can. (ktrace, debugging statements inserted into the API, etc)
6. Compare and contrast the information about what happens when things go right with what happens when things go wrong. What's different? What's the same? (What is up with those huge numbered file handles?)
7. Devise the simplest possible test case or isolate the code path that always replicates the problem.
8. Investigate what happened that got you to that code path.
And maybe the all-important last step: tell other people about your experience on your blog, so others can benefit from your nightmare.
Labels: convolution, dns, FreeBSD, java, leak, Lucene, sockets, software, unknownhostexception
Wednesday, July 04, 2007
Lucene Support in Tally-Ho
Since all Article manipulation for the site goes through the ArticleService, that makes it a very handy place to automatically index Articles as they are created and updated and as they go through the lifecycle (from submitted, to approved, accepted, and so-on).
Adding an article to the index is trivial. We create a Directory object that points to where we'd like Lucene to write its files, create and IndexWriter to write them there, create several Field objects to represent the names of fields and their contents that we'd like to search on, add those fields to a Document, and add the Document to the IndexWriter. We can then query on any combination of these fields. Great.
I had a problem come when it becomes necessary to *change* an article. Lucene does this via the updateDocument method, or you can call deleteDocument and addDocument yourself. The advantage to updateDocument is that it's atomic. But for me, neither strategy worked at first.
First of all, even though Lucene said it was performing a delete based on a Term (which in our case contains the primary key of the Article), it didn't actually do it unless the Field referenced by the Term was stored as Field.Index.UN_TOKENIZED. If I stored it TOKENIZED, Lucene claims to be deleting, but the deleted Document would still show up in search queries.
Secondly, when I tried to delete a document, it looked like I could never add another document with the same fields ever again.
The first case turned out to be caused by using the StopAnalyzer to tokenize the input. When you index a term as UN_TOKENIZED, Lucene skips the Analyzer when storing the term to the index. The StopAnalyzer tokenizes only sequences of letters. Numbers are ignored. This differs from the StandardAnalyzer, which also uses stop words, which tokenizes letters as well as numbers. Since we delete based on the
id
term, which is numeric, Lucene was never finding the document it was supposed to delete, as the term had been tokenized into nothing by the StopAnalyzer... so the old document was not found and consequently not deleted.The second case turned out to be caused by a fault in my unit test. I was doing an assertion that an updated article was in the database by doing a search on its
id
field. But I didn't assert that it was there before the update by searching the same way. For this reason it appeared that the article disappeared from the database and stayed away, because other unit tests worked (but those other tests also searched on other terms). Once I realized that the search on id
was always failing, everything began to fall in place. Note also that you specify a tokenizer on search queries as well, so even when I stored the id
term as UN_TOKENIZED, the StopAnalizer applied to the query would effectively eliminate the value of the search term (such that it could only ever find documents that had an empty id
).Lucene 2.2 has a great feature that lets you find documents in its index that are similar to a given document. The given document doesn't even need to be in the index, but it's very easy to do if it is. Since Tally-Ho automatically includes articles in the index as soon as they are created, this case applies. The code is very simple:
directory = FSDirectory.getDirectory(indexDirectory);
reader = IndexReader.open(directory);
searcher = new IndexSearcher(directory);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"combined"});
TermQuery findArticle = new TermQuery(new Term("id", String.valueOf(id)));
Hits hits = searcher.search(findArticle);
int luceneDocumentId = hits.id(0);
org.apache.lucene.search.Query query = mlt.like(luceneDocumentId);
hits = searcher.search(query);
I probably should be checking the first call to
searcher.search
to make sure the article for comparison is found (it should always be found, but sometime strange things happen).Subscribe to Posts [Atom]