Monday, October 29, 2007

The Most Convoluted Problem I Ever Solved

In the 17 or so years that I've been actively writing and debugging software, I have never come across a problem as convoluted as the one I finally resolved this evening. I literally fought with this problem for months before finally cracking it.

The background: I work on a project called Tally-Ho, a community management system that powers morons.org. It's a massive Java application using the Wicket framework which I run inside Tomcat 6.0.14.

One of the features of Tally Ho is a "story lead" queue where people can submit interesting news articles that they find around the Internet. There is a fairly simplistic URL validator on the URL field of the form which attempts to load the URL supplied by the user, making sure the host resolves, can be connected to, and doesn't return a 404.

The problem was that after a while, instead of resolving hosts, my log would fill up with UnknownHostExceptions. I could do an nslookup from the command line and see that these were legitimate hosts... some of the obviously so, like washingtonpost.com.

It looked like a classic negative caching problem at first, and at first it probably was, in part. In Java, address lookups, both positive and negative, get cached. My initial hypothesis was that some transient lookup failure was causing a negative lookup to be cached, and that the caching was lasting forever, despite my configuring a networkaddress.cache.negative.ttl value in $JAVA_HOME/jre/lib/security/java.security. This hypothesis seemed reasonable in part because I could see by snooping network traffic on port 53 that the JVM was not making any further DNS requests for the hosts in question. Also, restarting the JVM seemed to clear the problem every time, suggesting that once the host was in the nameserver's cache, everything was fine.

I began trying various things including using some old, deprecated Sun-specific caching options. That didn't work. I tried hacking at the InetAddress source code to completely remove its ability to cache. That seemed to work at first, but later the old behaviour somehow returned. Then I discovered using truss and ktrace that my JVM wasn't reading java.security at all, and -Djava.security.debug=properties didn't print anything. I rebuilt the JVM from the FreeBSD port after first removing the entire work directory and indicating that it should use the precompiled "Diablo" JVM to bootstrap the new JVM.

The rebuilt JVM seemed to read java.security, so I figured the problem was solved. Not so. It still happened after Tomcat ran for a while.

I wrote a simple command line tester which attempted a name lookup, waited for a keypress, and then tried the name lookup again. Then I'd restart named, firewall the name servers for that host, and run the test code. I could verify that it retried the host and did not negative-cache it forever when run from the command line. So something was different in what was happening inside Tomcat.

It was then that I noticed that Tomcat was running with its native libraries. I've seen strange things happen before whenever JNI was involved, so I poked around a bit and noticed with ldd that the native libraries had been built with gcc 4.2.1. Knowing that gcc 4.2.1 has serious problems, I rebuilt the native libraries and restarted Tomcat. I repeated the same steps I used in my command line test with my submission form via Tomcat, and saw that things seemed to be working now.

Hours went by, and the same damn exception flew up my log again. What the hell? It was then that I was breathing fire, my entire being converted into pure fury.

I decided to dive a level deeper, running ktrace against the running Tomcat process so I could see every system call it made. One red herring I dispensed with fairly quickly was that in the cases where hosts seemed to resolve properly, the JVM was reading /etc/resolv.conf, and in cases where they didn't it wasn't. But looking at the source code for ResolverConfigurationImpl, it was clear that this was probably due to its internal caching (hard coded to 5 minutes, mind you).

One thing in the kdump did catch my eye though:

 31013 jsvc     CALL  socket(0x2,0x2,0)
31013 jsvc RET socket 1414/0x586
31013 jsvc CALL close(0x586)


That file handle sure seems awfully big for a servlet container with a maximum of 256 simultaneous connections. Somewhere along this time I had also noticed that everything was fine when Tomcat had been recently restarted, but went bad after a while. I had also noticed at some point that caching seemed to have nothing to do with it. Once the failure mode had been entered, it didn't matter what the address was-- the resolver would throw an UnknownHostException for every host, immediately, without ever attempting a lookup to begin with.

So now I had a new hypothesis. That file handle number was awfully high. I was able to develop a test case that demonstrated that name resolution failed as soon as 1024 file handles were in use:

import java.net.*;
import java.util.*;
import java.io.*;

public class Test3 {
public static void main(String[] args) throws Exception {

ArrayList files = new ArrayList(1024);

System.out.println("Opening lots of files");
for(int i=0; i < 1024; i++) {
files.add(new FileInputStream("/dev/null"));
}

System.out.println("Trying to resolve freebsd.org");
InetAddress.getByName("freebsd.org"); // throws exception!

}
}

(It's actually only necessary to open 1020 files; stdin, stdout and stderr bump that number up to 1023, and on the 1024th file handle, it breaks).

My friend Alfred recalled that this is a FreeBSD libc bug, which has been corrected since my fairly ancient compilation of FreeBSD 6.2. At some time in the distant past, some library calls would refuse to cooperate with file handles > 1023 because they couldn't be used with select(2). My test case runs to completion on his host, but always fails with an UnknownHostException on my host. (On Linux it dies and complains about 'too many open files'. Teehee.)

So why was Tomcat leaking all these file descriptors? My first suspicion was the NioConnector, since it's new and known to be a bit buggy. I reconfigured Tomcat to use the older HTTP/1.1 connector. I waited a while, and ran ktrace on the process. No good, it was still using hundreds more file descriptors than it should have.

I decided to run fstat on the Tomcat process, and saw that it wasn't leaked sockets at all, but leaked file descriptors. Fstat, despite what the manual page claims about it showing open files, only shows the inode numbers of open files. (What dork thought that would be useful?) I downloaded and compiled lsof, which actually does list the files being held open by a process.

It was then that I saw the real root of all of this trouble: the directory structure used by Lucene, the search engine used by Tally Ho, was not being closed. Apparently each time the article search form was used, it was leaving a number of files opened. This was easy to fix by correcting a try/catch/finally block in the Java code to ensure that the Directory and IndexReader objects were always closed after use.

So to make a long story short, because the Directory object in the search engine wasn't getting closed, the application was overusing file handles, which was tickling a bug in FreeBSD that prevented socket writes from working correctly, which prevented hostnames from resolving, only after Tomcat had been running for a sufficient time to exhaust 1023 file handles, and this was after correcting a problem with a JVM that didn't read the java.security network address cache settings and a native library that was compiled with a bad version of gcc.

Holy f-ing crap.

The key lessons to learn from all of this are lessons for any debugging experience:

1. Develop a hypothesis, but don't get attached to it. (Your initial hypothesis may be wrong.) Revise your hypothesis as you get closer to the answer.
2. Eliminate unnecessary variables. (Like getting rid of native libraries.)
3. Check and recheck your assumptions. (Is java.security even getting read?)
4. Eliminate red herrings. (Reading resolv.conf has nothing to do with it.)
5. Collect more information any way you can. (ktrace, debugging statements inserted into the API, etc)
6. Compare and contrast the information about what happens when things go right with what happens when things go wrong. What's different? What's the same? (What is up with those huge numbered file handles?)
7. Devise the simplest possible test case or isolate the code path that always replicates the problem.
8. Investigate what happened that got you to that code path.

And maybe the all-important last step: tell other people about your experience on your blog, so others can benefit from your nightmare.

Labels: , , , , , , , ,


Comments:
You forgot one, which should probably be somewhere around #3 or #4.

Ask Alfred. :)

-Alfred
 
Wow. I would've jumped out of the window somewhere halfway your story :)
 
Hi Nick,

Found your blog looking for your bar code page. Interesting story. I admire your tenacity as this problem would have been the dearth of lesser minds.

Just wanted to let you know I can no longer access your barcode page. I found it useful and accurate and referred others to it when they asked me inane questions like "Why can't I put text in my bar code?"

I have used the info a few dozen times over the past six or so years. Since you don't always hear about it when you do something well, I thought I would tell you that you did something well. Probably not the only thing, but is a thing I am fairly qualified to comment on.

Good luck in your new adventure.
 
Had a little trouble with it post-perl-upgrade.

It's working now.
 
This post has been removed by a blog administrator.
 
This post has been removed by a blog administrator.
 
Post a Comment





<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]