Quite often when multiple components of a system are combined, one can experience unexpected side-effects and unexpected failures, the cause of which seems completely unrelated to the consequence. I recently experienced a situation in which a bit of configuration to jDeb (a Maven plug-in that assists in the creation of debian .deb packages) caused an application to stall out completely after about 4 minutes of run-time.
Our app is a Java program that runs from a .jar file, and we run it in production wrapped inside YAJSW (Yet Another Java Service Wrapper). YAJSW allows us to do a few nice things like run as a daemon and run as a different, non-root user, along with some other nice features. We log using sjf4j with log4j, and YAJSW also intercepts System.out and System.err to log those messages to its own log file. The whole thing gets wrapped up in a .deb package using jDeb, which lets us install smoothly, installing the daemon and updating the rc.d files.
After installation, our program started up fine and began doing its customary work. Success! Almost! After about four minutes, it stopped working, no longer making any of the API calls we expected, and generally behaving in an unresponsive manner.
The first thing I did was run jstack against the running Java process to see what all the threads were doing. Almost all the threads in the application were in a BLOCKED state on org.apache.log4j.Category.callAppenders. A quick Google search revealed a lot of complaints about similar behavior, even finding one comment that described exactly what we were seeing: one thread seemed to be stuck in java.io.FileOutputStream.writeBytes forever, and all the other threads waiting for a lock so they could write to the appender too.
So what could be causing a thread to block forever trying to perform a simple write? Cursory checks to see if there were any issues with excessive garbage collection, or impending disk failures revealed no problems.
Things seemed to run okay running the jar file from the command line, as our effective user, by just using sudo -u and java -jar, so it seemed in some way related to running our application from inside YAJSW.
A little more Google searching found a YAJSW thread which mentioned the need of creating several memory-mapped files to which the wrapped process would write instead of writing to stdout and stderr; the wrapping process then 'gobbles' from these memory-mapped files for the purposes of logging.
Aha! Another thing I had noticed was that we didn't seem to get any console messages from our app in the YAJSW log like we would have expected, and a quick look revealed that in fact, no memory-mapped files for stdout and stderr were created in tmp/ for the wrapped app to write to.
So at this point it was starting to look like log4j was trying to write to its redirected System.err or System.out (for the console appender), and once some buffer filled, further writes blocked, while holding the appender lock, preventing any other threads from making any further progress, blocked trying to log.
Closer examination of the permissions on the tmp/ directory revealed that dpkg had created it with mode 600 instead of mode 700, and creating a new file is not possible on a Posix system without execute permission on the directory. Because of inadequate permissions, the memory-mapped files for System.out and System.err redirection couldn't be created, a failure which YAJSW couldn't log, because the files were needed to set up logging.
The culprit turned out to be my own misunderstanding of how jDeb's permissions work when applying a template data type in a dataSet. The perm mapper allows you to specify a user, group, filemode and dirmode for use when creating directories and files, but when using the template data type, the mode that gets used for creating the paths is not the dirmode, but the filemode. Consequently, the filemode needed to be 700 rather than 600 for the mapper for the template paths.
The chain of failure ended up looking like this:
- I misunderstood how jDeb/Debian package permission mappings worked when creating new, empty directories in a .deb package. Intuitively it seemed like dirmode permissions would be used when creating an empty directory, but in fact filemode permissions get used.
- This caused dpkg to create the new tmp/ directory with mode 600, excluding execute permission on the directory.
- Because of the exclusion of the execute bit, the YAJSW wrapper could not create the memory-mapped files it needed for writing using its overridden System.out and System.err.
- Because the files didn't get created, the YAJSW wrapper process could not consume the output of System.out and System.err, and the wrapped process had nowhere to write once whatever internal buffering existed was full.
- Because the buffer filled up and writes blocked, one log4j appender blocked waiting to write while inside a synchronized block.
- Because the write inside the synchronized block was blocked forever, all of the other threads that wanted to log anything then blocked waiting to enter the synchronized block.
- The whole process stopped working because eventually everything was waiting on a lock or waiting for IO to complete.
And that is how a permission mistake in a configuration file can lead to an application locking up after about 4 minutes (the time it took to fill up some buffer somewhere).