Nick Johnson

Friday, November 23, 2007

A Major Milestone for Tally-Ho: Arbitrary HTML Pages

Tonight I created the first localized, arbitrary HTML page in Tally-Ho. I don't have all of the corner cases handled yet, but I was able to go to the creation page, choose a Locale, give an arbitrary "path", filename, title and content, save the page and then view the page with a nice URL that makes it look like a static page.

The whole mess integrates directly into the Tally-Ho BinaryResourceService locator/localizer system, just as the existing Wicket integration via the BinaryResourceStreamLocator does now, so it automatically takes advantage of localization and caching.

One of the more cumbersome challenges in this effort was getting a nice URL. Wicket has many URL coding strategies for Bookmarkable pages; I use IndexedParamUrlCodingStrategy quite a bit. But IndexedParamUrlCodingStrategy wasn't going to work in this case. It takes each element of a path and associates it with an index number. What I needed was the path itself as a parameter, not chopped up and indexed.

It turns out that writing one of these coding strategies from scratch for Wicket is difficult if you don't know what you're doing (like me), and the Javadoc for the various classes involved is sadly a bit lacking in direction. So I took a different approach: I stole a bunch of code from IndexedParamUrlCodingStrategy and modified it to fit my needs. Behold, UriPathUrlCodingStrategy (with comments snipped for space... they're in CVS, though, and rest assured they credit Igor for writing IndexedParamUrlCodingStrategy):


package net.spatula.tally_ho.wicket;

import java.io.UnsupportedEncodingException;
import java.util.Map;

import wicket.Application;
import wicket.PageMap;
import wicket.PageParameters;
import wicket.WicketRuntimeException;
import wicket.protocol.http.request.WebRequestCodingStrategy;
import wicket.request.target.coding.BookmarkablePageRequestTargetUrlCodingStrategy;
import wicket.settings.IRequestCycleSettings;
import wicket.util.string.AppendingStringBuffer;
import wicket.util.value.ValueMap;

public class UriPathUrlCodingStrategy extends BookmarkablePageRequestTargetUrlCodingStrategy {


    public UriPathUrlCodingStrategy(String mountPath, Class bookmarkablePageClass) {
        super(mountPath, bookmarkablePageClass, PageMap.DEFAULT_NAME);
    }

    public UriPathUrlCodingStrategy(String mountPath, Class bookmarkablePageClass, String pageMapName) {
        super(mountPath, bookmarkablePageClass, pageMapName);
    }

    protected void appendParameters(AppendingStringBuffer url, Map parameters) {
        if (parameters.containsKey("uri")) {
            String[] pathParts = ((String) parameters.get("uri")).split("/+");
            for (String string : pathParts) {
                if (string == null || "".equals(string)) {
                    continue;
                }
                try {
                    Application app = Application.get();
                    IRequestCycleSettings settings = app.getRequestCycleSettings();
                    url.append("/").append(java.net.URLEncoder.encode(string, settings.getResponseRequestEncoding()));
                } catch (UnsupportedEncodingException e) {
                    throw new WicketRuntimeException(e);
                }
            }
        }

        String pageMap = (String) parameters.get(WebRequestCodingStrategy.PAGEMAP);
        if (pageMap != null) {
            url.append("/").append(WebRequestCodingStrategy.PAGEMAP).append("/").append(urlEncode(pageMap));
        }

    }

    protected ValueMap decodeParameters(String urlFragment, Map urlParameters) {
        PageParameters params = new PageParameters();
        if (urlFragment == null) {
            return params;
        }
        if (urlFragment.startsWith("/")) {
            urlFragment = urlFragment.substring(1);
        }

        String[] parts = urlFragment.split("/");
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < parts.length; i++) {
            if (WebRequestCodingStrategy.PAGEMAP.equals(parts[i])) {
                i++;
                params.put(WebRequestCodingStrategy.PAGEMAP, parts[i]);
            } else {
                builder.append("/").append(parts[i]);
            }
        }
        params.put("uri", builder.toString());
        return params;
    }

}

The next steps will be update capability for the HTML pages and then an attachment selector/uploader tool to handle the association of other resources to the HTML page.

This will be a giant leap forward for Tally-Ho and allow for the conversion of dozens of old morons.org pages to the new system.

Labels: software, tally-ho, wicket

# posted by Nick : 8:29 PM 1 Comments

Sunday, November 11, 2007

What's New in Tally-Ho?

And just where is that 1.0 release, anyway?

Well, I made good on my threat to rip out Toplink Essentials and replace it with OpenJPA. OpenJPA is a bit more pedantic about some things. For example, this code would run fine in Toplink but would throw an IllegalStateException in OpenJPA:


    entityManager.getTransaction.begin();
    entityManager.close();

While I was working on dropping in OpenJPA, I decided that I really wanted my tests to pass from within Maven, so I could be sure that run-time enhanced (woven) classes were all going to work nicely. I also wanted to make sure that none of the tests depended on any data to be in the database that wasn't put there by the SQL scripts to initialize the DB. So I modified my base test class to perform one-time database wiping/initialization prior to running any tests. This exposed a great many flaws in tests that I wrote in a fairly lazy fashion to assume that certain objects were already present.

After fixing all of that, I decided to let Eclipse clean up a lot of other code for me. Eclipse's Source-->Cleanup feature is very powerful, allowing you to "final" gobs of things and implement the default serial ID for Serializable classes in one giant swoop.

Then I got to work on what the next major project / feature is for Tally-Ho: arbitrary HTML pages. For quite a while I agonized over how to manage associations between pages. In most of the world, if a page changes its name or location, links to it break. I also needed the capability to attach images, PDFs or other documents to arbitrary HTML. It turns out that the solution to both of these problems is the same. In a massive refactoring of BinaryResource, any BinaryResourceReference can now be attached to any other BinaryResourceReference. BinaryResource is gone, and instead the relationship is now 1 BinaryResourceReference has many BinaryResourceReferenceLocales, each of which has one BinaryResourceContent. A BinaryResourceReference may also have many Attachments, which have an sequence number and a reference to the attached BinaryResourceReference. An HtmlPage is just a subclass of BinaryResourceReference with some bits added for the title, keywords, whether to include a message board, etc.

Attachments are numbered in sequence (1, 2, 3). Inside the HtmlPageService, references to attachments are converted by an AttachmentUrlProvider (an interface) and Velocity to their URLs. So if you want to refer to the URL for attachment #1, you use ${1} in the HTML. (Roughly... this bit isn't done yet.) It is up to the AttachmentUrlProvider to decide how to make the URL, given the scope, path and extension.

This refactoring is probably 75% complete. I'm too burnt out on code and too tired to work on it any more this weekend.

So to summarize what's changed:

1. OpenJPA replaced Toplink Essentials
2. Everything builds and tests in Maven (including compile time bytecode instrumentation)
3. Refactoring of binary resources
4. Initial HtmlPage work
5. Binary resource attachment support.

A couple other things I learned today about JPA:

1. If you JOIN multiple things, at least with OpenJPA you need to alias each thing you join. Ie, JOIN x.foo foo JOIN foo.bar bar. The parser will complain if you leave off that last "bar" in that example.
2. Your ability to lazy load ends once the EntityManager you used to load your object is closed. I knew this before and subsequently forgot, and then learned it again the hard way. Merging the entity with a new EntityManager doesn't work either. You need to keep the original one open until you're done navigating your object graph.

Labels: openjpa, software, tally-ho, toplink essentials, velocity

# posted by Nick : 6:09 PM 1 Comments

Sunday, November 04, 2007

Time for a Divorce

I've been using Toplink Essentials as the JPA provider for Tally Ho almost exclusively since the project began, except for a quick look at OpenJPA. This is in part because of my experience with the commercial Toplink product- I know that Oracle's Toplink is a mature product (having started out in the early 90's as a Smalltalk persistence provider), and I am comfortable working with it. Unfortunately, the open source Toplink Essentials product does not live up to the promise of Toplink. I've reached the point where I'm tired of coding around its bugs, and now that there are other, healthier projects out there, I shouldn't have to.

That got me to thinking: a lot of us developers use a lot of open source software. How do we choose which packages we want to use? Obviously whatever we choose has to be a good technical fit for our needs... what's the point if it doesn't do the job we're after? But now it occurs to me that open source software has to meet a particular social need as well. One way we can gauge a project's health is by how strong it is socially. How interested are people? Is the project active? Are people excited about the project? Excited enough to fix bugs?

It's a little hard to compare apples to apples in this case, but let's look at a couple things and try to relate them as best we can. Toplink Essentials is maintained as part of the Glassfish project.

In the last 30 days at the time of this writing, the folks working on it have fixed 10 bugs. In that same amount of time, 15 bugs were opened (or changed and left in an opened state). About 53 messages have been posted on the discussion forum. The oldest unresolved bug has been open for about a year and 10 months.

In that same amount of time, the OpenJPA folks have resolved 19 bugs while 20 have been opened. The mailing list has had about 215 posts. The oldest unresolved bug is a year and 4 months old (though it was touched 3 months ago). OpenJPA is using Jira which makes it a bit easier to produce meaningful metrics such that we can find that the average unresolved age of a bug in the last month is about 3 months, which has been fairly consistent.

(I gave up trying to compute the average unresolved age of bugs for Toplink Essentials. It's just too annoying to figure out if the bug tracking tool doesn't do it for you.)

It is probably the case that most open source projects (and probably closed ones too) have a few ancient bugs gathering dust. I think that it's more interesting to look at what a project has been doing recently, like in the last 30-180 days. Are they keeping up with their bug backlog? Is there an active community? Are you likely to get help if you ask for it? Of the bugs that come in, what percentage get fixed and what percentage get dumped in the attic?

And perhaps the most important criteria of all: are they fixing MY bug?

While I wasn't watching, OpenJPA reached a 1.0.0 release. It's available under the Apache 2 license from a Maven 2 repository. They fixed the bug I opened earlier this year (within a day even). It is full-featured and even has an extensive manual. Though, like Toplink, their ant task doesn't work very well.

I used to be concerned about the large number of dependencies that OpenJPA has, but now that the project is building with Maven 2, it's much less of a concern for me. It isn't necessary to go manually fetch anything to build the project, since Maven 2 takes care of all the direct and transitive dependencies. One thing I did have to manually tweak was to force inclusion of commons-collections 3.2 in my pom.xml, because something else in my project depends on an earlier version of commons-collections, and OpenJPA needs a later version.

So it's time to give Toplink one final heave-ho. My reasons for sticking with it have now been outweighed by my need of having compile-time weaving that works and a project where problems are likely to be fixed within my lifetime. It's time for Toplink and I to start seeing other people.

New releases of Tally-Ho will be using OpenJPA as the persistence provider... just as soon as I get all the unit tests passing.

Labels: JPA, open JPA, open source, openjpa, software, tally-ho, toplink, toplink essentials, weaving

# posted by Nick : 7:45 AM 1 Comments

Saturday, November 03, 2007

How to do Static Weaving with Toplink Essentials from Maven 2

I've always wanted to make sure that the build process for Tally Ho is as seamless as possible, requiring minimal configuration by someone who downloads the monster. Currently, I do have one small step that people have to do- they must install the Toplink Essentials jar files in their local repository. This is necessary because of a licensing issue from Oracle apparently... you have to agree to a license before their jar will unpack. Fortunately, this is a fairly simple procedure, which I cover in README-MAVEN thusly:

toplink-essentials-V2_build_58: Get this from https://glassfish.dev.java.net/downloads/persistence/JavaPersistence.html. Download the jar, then run java -jar glassfish-persistence-installer*.jar, accept the license agreement, and the installer will create a glassfish-persistence directory. Change into that directory and run mvn install:install-file -Dfile=toplink-essentials.jar \ -DgroupId=toplink-essentials -DartifactId=toplink-essentials \ -Dversion=V2_build_58 -Dpackaging=jar

So far in the project I haven't used static weaving, because it hasn't been all that vital. But an upcoming change in the way binary resources work requires it. (I need to be able to refer to a binary resource without loading all of that resource's data, since that could conceivably be a tremendous amount of data and a huge memory hog... this requires a lazy-loaded 1:1 relationship (Toplink Essentials does not support lazily loaded compositions)).

The Toplink folks do not provide a Maven 2 plugin for doing static weaving (and you must use static weaving if you're deploying to a J2SE servlet container due to classloader constraints on javaagent). They do, however, provide an ant task for static weaving, and Maven 2 can run ant tasks.

It took some time to figure out how to get the Maven dependency classpath into the Ant task so that the Ant task could find the class for static weaving, but after some digging I found the answer.

First, we define a build.xml for the ant task, to keep from severely uglifying pom.xml:

<project name="Weaver" default="weaving" basedir=".">

 <description>
  Run the ant task for performing static weaving on model classes.  This
  is meant to be run from m2 with the compile_classpath variable set.
 </description>
 
 <target name="define.task" description="New task definition for toplink static weaving">  
  <taskdef name="weave" classname="oracle.toplink.essentials.weaving.StaticWeaveAntTask">
   <classpath>
    <path path="${compile_classpath}" />
   </classpath>
  </taskdef>  
 </target>
 
 <target name="weaving" description="perform weaving" depends="define.task">
  <echo>Performing static weaving on model classes
  
  <weave source="target/classes" target="target/classes" persistenceinfo="src/main/resources">
   <classpath>   
             <path path="${compile_classpath}"/>
         </classpath>
  </weave>
 </target>
 
</project>

The Eclipse ant task editor will of course complain that the taskdef class cannot be found, but that's okay because we don't intend to run this with ant. We're going to run it from Maven2, using the ant task runner.

One nice thing about Maven 2 is that they've included a phase just for post-processing of classes, which is the ideal place to hook into the compilation process. We add this to our pom.xml inside the <build><plugins>:

      <plugin>
       <groupId>org.apache.maven.plugins
        <artifactId>maven-antrun-plugin
        <executions>
          <execution>
            <id>process-classes
            <phase>process-classes
            <configuration>          
              
                <echo>Beginning process-classes phase...
                <property name="compile_classpath" refid="maven.compile.classpath"/>    
                <ant antfile="${basedir}/build.xml">                 
                  <target name="weaving"/>
                </ant>
              </tasks>
            </configuration>
            <goals>
              <goal>run
            </goals>
          </execution>
        </executions>
      </plugin>

Maven creates an ant property called compile_classpath which dereferences to maven.compile.classpath property, which includes all of the compile-time dependencies declared in the pom.xml. In this case, since the pom already contains the Toplink Essentials jar file, the compile classpath will contain the class needed to run the ant taskdef.

There are, of course, still bugs with static weaving. The weaver still breaks if there is a space in the path to your classes and it still incorrectly weaves classes with lazy 1:1 relationships, failing to add some required methods for 1:1 fields that aren't lazy loaded. As these bugs haven't been touched since March of this year, I don't hold out a lot of hope for seeing them fixed any time soon.

The workaround for the first problem is to move your Eclipse workspace (or other working directory) to a path with no spaces in the name. The second can be worked around by using property access on lazy-loaded fields, although that is lame and stupid.

What kills me about that latter bug is that in the comments, the person the bug is assigned to describes exactly what needs to be done to fix the bug and where in the code to fix it. This means he had to have been looking around in the code to find it. And once he found it, rather than just fixing the damn problem, then saving and committing, he talked about how to fix it on the bug report instead. And indeed, one can still go look at the code and see how it's still broken to this day, when it could have been resolved 8 months ago for less effort than it took to write about it. It's literally a one-line fix. Maybe even half-a-line, if you want to get technical. Select some text, hit backspace, ctrl-S, run unit tests, commit.

Incidentally, it also turns out that the Toplink ant task does absolutely nothing at present. If you find that troubling, you can swap out the <weave> task for a kludge like this:

  <java classname="oracle.toplink.essentials.weaving.StaticWeave" >
   <classpath>
    <path path="${compile_classpath}"/>
   </classpath>
   <arg value="-persistenceinfo" />
   <arg value="src/main/resources" />
   <arg value="-loglevel" />
   <arg value="finest" />
   <arg value="${target_directory}" />
   <arg value="${target_directory}" />
  </java>

At least this way the ugliness is encapsulated in an ant build.xml file, and some day when the ant task gets fixed, you can return to using it easily.

Labels: ant, maven, software, toplink, weaving

# posted by Nick : 11:27 AM 1 Comments

Monday, October 29, 2007

The Most Convoluted Problem I Ever Solved

In the 17 or so years that I've been actively writing and debugging software, I have never come across a problem as convoluted as the one I finally resolved this evening. I literally fought with this problem for months before finally cracking it.

The background: I work on a project called Tally-Ho, a community management system that powers morons.org. It's a massive Java application using the Wicket framework which I run inside Tomcat 6.0.14.

One of the features of Tally Ho is a "story lead" queue where people can submit interesting news articles that they find around the Internet. There is a fairly simplistic URL validator on the URL field of the form which attempts to load the URL supplied by the user, making sure the host resolves, can be connected to, and doesn't return a 404.

The problem was that after a while, instead of resolving hosts, my log would fill up with UnknownHostExceptions. I could do an nslookup from the command line and see that these were legitimate hosts... some of the obviously so, like washingtonpost.com.

It looked like a classic negative caching problem at first, and at first it probably was, in part. In Java, address lookups, both positive and negative, get cached. My initial hypothesis was that some transient lookup failure was causing a negative lookup to be cached, and that the caching was lasting forever, despite my configuring a networkaddress.cache.negative.ttl value in $JAVA_HOME/jre/lib/security/java.security. This hypothesis seemed reasonable in part because I could see by snooping network traffic on port 53 that the JVM was not making any further DNS requests for the hosts in question. Also, restarting the JVM seemed to clear the problem every time, suggesting that once the host was in the nameserver's cache, everything was fine.

I began trying various things including using some old, deprecated Sun-specific caching options. That didn't work. I tried hacking at the InetAddress source code to completely remove its ability to cache. That seemed to work at first, but later the old behaviour somehow returned. Then I discovered using truss and ktrace that my JVM wasn't reading java.security at all, and -Djava.security.debug=properties didn't print anything. I rebuilt the JVM from the FreeBSD port after first removing the entire work directory and indicating that it should use the precompiled "Diablo" JVM to bootstrap the new JVM.

The rebuilt JVM seemed to read java.security, so I figured the problem was solved. Not so. It still happened after Tomcat ran for a while.

I wrote a simple command line tester which attempted a name lookup, waited for a keypress, and then tried the name lookup again. Then I'd restart named, firewall the name servers for that host, and run the test code. I could verify that it retried the host and did not negative-cache it forever when run from the command line. So something was different in what was happening inside Tomcat.

It was then that I noticed that Tomcat was running with its native libraries. I've seen strange things happen before whenever JNI was involved, so I poked around a bit and noticed with ldd that the native libraries had been built with gcc 4.2.1. Knowing that gcc 4.2.1 has serious problems, I rebuilt the native libraries and restarted Tomcat. I repeated the same steps I used in my command line test with my submission form via Tomcat, and saw that things seemed to be working now.

Hours went by, and the same damn exception flew up my log again. What the hell? It was then that I was breathing fire, my entire being converted into pure fury.

I decided to dive a level deeper, running ktrace against the running Tomcat process so I could see every system call it made. One red herring I dispensed with fairly quickly was that in the cases where hosts seemed to resolve properly, the JVM was reading /etc/resolv.conf, and in cases where they didn't it wasn't. But looking at the source code for ResolverConfigurationImpl, it was clear that this was probably due to its internal caching (hard coded to 5 minutes, mind you).

One thing in the kdump did catch my eye though:

 31013 jsvc     CALL  socket(0x2,0x2,0)
 31013 jsvc     RET   socket 1414/0x586
 31013 jsvc     CALL  close(0x586)

That file handle sure seems awfully big for a servlet container with a maximum of 256 simultaneous connections. Somewhere along this time I had also noticed that everything was fine when Tomcat had been recently restarted, but went bad after a while. I had also noticed at some point that caching seemed to have nothing to do with it. Once the failure mode had been entered, it didn't matter what the address was-- the resolver would throw an UnknownHostException for every host, immediately, without ever attempting a lookup to begin with.

So now I had a new hypothesis. That file handle number was awfully high. I was able to develop a test case that demonstrated that name resolution failed as soon as 1024 file handles were in use:


import java.net.*;
import java.util.*;
import java.io.*;

public class Test3 {
        public static void main(String[] args) throws Exception {

                ArrayList files = new ArrayList(1024);

                System.out.println("Opening lots of files");
                for(int i=0; i < 1024; i++) {
                        files.add(new FileInputStream("/dev/null"));
                }

                System.out.println("Trying to resolve freebsd.org");
                InetAddress.getByName("freebsd.org");   // throws exception!

        }
}

(It's actually only necessary to open 1020 files; stdin, stdout and stderr bump that number up to 1023, and on the 1024th file handle, it breaks).

My friend Alfred recalled that this is a FreeBSD libc bug, which has been corrected since my fairly ancient compilation of FreeBSD 6.2. At some time in the distant past, some library calls would refuse to cooperate with file handles > 1023 because they couldn't be used with select(2). My test case runs to completion on his host, but always fails with an UnknownHostException on my host. (On Linux it dies and complains about 'too many open files'. Teehee.)

So why was Tomcat leaking all these file descriptors? My first suspicion was the NioConnector, since it's new and known to be a bit buggy. I reconfigured Tomcat to use the older HTTP/1.1 connector. I waited a while, and ran ktrace on the process. No good, it was still using hundreds more file descriptors than it should have.

I decided to run fstat on the Tomcat process, and saw that it wasn't leaked sockets at all, but leaked file descriptors. Fstat, despite what the manual page claims about it showing open files, only shows the inode numbers of open files. (What dork thought that would be useful?) I downloaded and compiled lsof, which actually does list the files being held open by a process.

It was then that I saw the real root of all of this trouble: the directory structure used by Lucene, the search engine used by Tally Ho, was not being closed. Apparently each time the article search form was used, it was leaving a number of files opened. This was easy to fix by correcting a try/catch/finally block in the Java code to ensure that the Directory and IndexReader objects were always closed after use.

So to make a long story short, because the Directory object in the search engine wasn't getting closed, the application was overusing file handles, which was tickling a bug in FreeBSD that prevented socket writes from working correctly, which prevented hostnames from resolving, only after Tomcat had been running for a sufficient time to exhaust 1023 file handles, and this was after correcting a problem with a JVM that didn't read the java.security network address cache settings and a native library that was compiled with a bad version of gcc.

Holy f-ing crap.

The key lessons to learn from all of this are lessons for any debugging experience:

1. Develop a hypothesis, but don't get attached to it. (Your initial hypothesis may be wrong.) Revise your hypothesis as you get closer to the answer.
2. Eliminate unnecessary variables. (Like getting rid of native libraries.)
3. Check and recheck your assumptions. (Is java.security even getting read?)
4. Eliminate red herrings. (Reading resolv.conf has nothing to do with it.)
5. Collect more information any way you can. (ktrace, debugging statements inserted into the API, etc)
6. Compare and contrast the information about what happens when things go right with what happens when things go wrong. What's different? What's the same? (What is up with those huge numbered file handles?)
7. Devise the simplest possible test case or isolate the code path that always replicates the problem.
8. Investigate what happened that got you to that code path.

And maybe the all-important last step: tell other people about your experience on your blog, so others can benefit from your nightmare.

Labels: convolution, dns, FreeBSD, java, leak, Lucene, sockets, software, unknownhostexception

# posted by Nick : 10:06 PM 6 Comments

Tuesday, October 16, 2007

Tally-Ho Turns 1

Today marks one year since I did the first commits on Tally Ho, the software behemoth that runs morons.org. Well, at least part of it.

I guess I didn't figure on it taking well over a year to port the whole site over to a new architecture, one that would scale nicely and could be maintained without agony. Life has this tendency to get complicated.

When I first started thinking about doing the massive morons.org refactoring (before it even had a name), I didn't even have a boyfriend. I hadn't yet started taking classes in the Hendrickson Method of Orthopedic Massage. I certainly hadn't decided to aim myself in the direction of Chiropractic School and to begin walking.

Yeah, life gets complicated. And as we get older, our priorities change too. I think large, open-source projects may be better suited for kids in their 20's, who still have lots of energy, free time, and don't get laid. What a perfect combination for free software development!

Now that I'm on in years, I want other things out of life. I want to travel, to see new things. I want to climb Mount Shasta. I want to maintain a healthy relationship. I want to always be learning things, especially things unrelated to computer science, so I'm always challenged. I want to spend time in the gym, working on my strength, cardiovascular fitness and endurance. I want to watch many more years of Doctor Who. I want to try to compose some music, even if I don't share it with anyone.

Now don't take this to mean that Tally-Ho is dead or won't be completed. I just committed some code yesterday to bring back the Partners system. Instead, take this to mean that Tally Ho is taking longer than expected, because life gets complicated.

Happy first birthday, Tally Ho! Maybe we'll get to version 1.0 within another year!

Labels: life, software, tally-ho

# posted by Nick : 9:23 PM 0 Comments

Monday, September 17, 2007

Why Should a Heartbeat be Regular?

Consider a common polling loop:


while(true) {
    runSomeQuery();
    doSomeStuff();
    sleepBeforeNextIteration();
}

I've seen and used this pattern probably a thousand times, without really thinking about it. It just stood to reason that the sleeping bit should be some constant number of seconds, or maybe in a fancy case, a computed number of seconds taking into account the time that was spent running a query and doing stuff.

But why should a heartbeat by so regular? It may be worth taking some time to consider the characteristics of doSomeStuff(), especially in the case that doSomeStuff() might be altering the results of runSomeQuery().

I recently had a situation where this was true. The stuff-doing was creating a new row to be returned by the query. Yet even knowing that I had just placed something into my queue, it was going to still be some regular sleep duration before I queried again.

As it happens, where this loop exists I don't necessarily get to know whether I will be producing new rows or not or whether new rows have been produced (without going into the details, it's a highly decoupled architecture). But a fairly simplistic optimization can improve performance significantly without increasing the average number of queries per minute: instead of having one, regular sleep duration, I introduced two durations.

To see how this works, consider a job that creates a second job:

time (s)	event
0	All is quiet
1	Job 1 is inserted
3	heartbeat Job 1 is pulled from the queue
3.01	Job 1 executes and places Job 2 in the queue
6	heartbeat Job 2 is pulled from the queue and executes

Now consider what happens with a rhythmic, yet non-regular heartbeat:

time (s)	event
0	All is quiet
1	Job 1 is inserted
4	heartbeat Job 1 is pulled from the queue
4.01	Job 1 executes and places Job 2 in the queue
6	heartbeat Job 2 is pulled from the queue and executes

As you can see, in this case, both methods perform exactly the same way. But consider what happens in the case where the job happens to get inserted just before the first query runs:

time (s)	event
0	All is quiet
3	Job 1 is inserted
3	heartbeat Job 1 is pulled from the queue
3.01	Job 1 executes and places Job 2 in the queue
6	heartbeat Job 2 is pulled from the queue and executes

Now consider what happens with a rhythmic, yet non-regular heartbeat:

time (s)	event
0	All is quiet
4	Job 1 is inserted
4	heartbeat Job 1 is pulled from the queue
4.01	Job 1 executes and places Job 2 in the queue
6	heartbeat Job 2 is pulled from the queue and executes

In the lucky case, a full second is chopped off the amount of time it takes from job insertion of the first job to execution of the second job.

Then there's the most unlucky case:

time (s)	event
4	All is quiet
4.1	Job 1 is inserted
6	heartbeat Job 1 is pulled from the queue
6.01	Job 1 executes and places Job 2 in the queue
10	heartbeat Job 2 is pulled from the queue and executes

In this case, the first job happens to hit right after the query that followed the long sleep. But it still gets to executing the second job within 6 seconds. The regular heartbeat is guaranteed to complete within 6 seconds, but could go as quickly as just over 3 seconds in its most lucky case.

By alternating between a long sleep and a short sleep, we can shave off 1 second in the most lucky case, but in the least lucky case, we're no worse off than we would be with a regular heartbeat. In the real world, jobs will come in at any random time, so sometimes you'll get lucky and sometimes you won't... but when you do get lucky, the payout is considerable. This is especially true given that the cost is effectively nothing: the same number of queries will be performed over a minute (or indeed over 6 seconds in this case) and the cost of alternating the sleep times is negligible.

Remember, of course, that this won't help all situations. It only applied in my case because the result of performing processing on something I get from my query could be to cause something new to be returned from running the query again.

There might be additional tweaks that could be made to this general idea; for example, speeding up the heartbeat when rows are found by the query, then slowing it back down when they aren't, perhaps bounded by some rules to ensure that over a given time interval, some maximum number of queries are performed.

The moral of the story is to know what your workload tends to look like and make reasonable and simple trade-offs that might improve your performance.

Labels: heartbeat, polling, queue, queueing, software

# posted by Nick : 6:24 PM 0 Comments

Tuesday, August 14, 2007

Keeping Simple Things Simple

You can always tell an overengineered API when you try to do something really simplistic and it takes you about 30 lines of code and 8 different classes.

For example: I want to read an image, make it smaller, and write it out as a JPG with a given quality. This is a fairly common and simple task. Here's the code:


        BufferedImage sourceImage
        try {
            sourceImage = ImageIO.read(new ByteArrayInputStream(imageData));
        } catch (IOException e) {
            return new ServiceResult(ServiceResult.STATUS.FAIL_HARD,
                    "Unable to read image; image is unreadable or an unsupported type", e);
        }

        // figure out the target width and height
        
        BufferedImage newImage =  getScaledInstance(sourceImage, targetWidth, targetHeight, RenderingHints.VALUE_INTERPOLATION_BILINEAR, true);

        ByteArrayOutputStream output = new ByteArrayOutputStream();

        Iterator imageWritersByMIMEType = ImageIO.getImageWritersByMIMEType("image/jpeg");
        if (imageWritersByMIMEType.hasNext()) {
            ImageWriter writer = imageWritersByMIMEType.next();
            writer.setOutput(new MemoryCacheImageOutputStream(output));
            ImageWriteParam iwp = writer.getDefaultWriteParam();
            iwp.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
            iwp.setCompressionQuality(COMPRESSION_QUALITY);
            
            
            IIOImage tmpImage = new IIOImage(newImage, null, null);
            try {
                writer.write(null, tmpImage, iwp);
            } catch (IOException e) {
                return new ServiceResult(ServiceResult.STATUS.FAIL_HARD, "Exception while creating jpeg content", e);
            }
        } else {
            return new ServiceResult(ServiceResult.STATUS.FAIL_HARD, "Couldn't find a jpeg encoder!");
        }

        return new ServiceResult(ServiceResult.STATUS.OK, "Created avatar successfully", output.toByteArray());

So we had to use: BufferedImage, ImageIO, ByteArrayInputStream, ByteArrayOutputStream, Graphics2D (inside the getScaledInstance method), ImageWriter, ImageWriteParam, MemoryCacheImageOutputStream and IIOImage. (Not counting any java.lang, java.util or exception stuff.)

Why can't ImageIO read a byte array? (I can almost forgive the design decision to work mainly with streams.) Why does an ImageWriter need to write to a MemoryCacheImageOutputStream (an ordinary OutputStream won't do)? What's with ImageWriter#setOutput taking an Object? Do we really expect to have multiple image writers for a given MIME type such that we need an iterator? Why can't an ImageWriter write a BufferedImage? Why is there a getDefaultWriteParam, if that's the only WriteParam there is to get?

javax.imageio has got to be the absolutely most retarded API I have seen to date, with the closest runner-up being jTidy.

Labels: api, javax.imageio, overengineering, software, stupid

# posted by Nick : 7:20 PM 0 Comments

Toplink Query Deficiencies

In any relational setting, it is wise to avoid this situation if you can:

TABLE_ONE
---------
object_id serial primary key not null
foo_id integer not null references TABLE_TWO(object_id)
bar_id integer not null references TABLE_TWO(object_id)

TABLE_TWO
---------
object_id serial primary key not null
some_field varchar(80) not null

There is no way to perform a single join from table_one to table_two to get a complete set of information, because of the references to multiple different rows in table_two. It's probably better to reorganize the relationship if you can.

The exception is if you don't necessarily need both fields. For example, if you can do without knowing anything about bar_id in most cases, you could always lazy load that field. That is, if lazy loading for 1:1 fields works in your JPA provider.

It still doesn't work with Toplink, and these bugs are STILL open:

https://glassfish.dev.java.net/issues/show_bug.cgi?id=2546
https://glassfish.dev.java.net/issues/show_bug.cgi?id=2554

The consequence is that I cannot static weave my model, so my only choice for now is to mark the fields as @Transient and ignore them for now. (I have a feature that tracks the changes made to every entry in a table with a changer that references an Account... for now, I just won't track the changer until I have time to come up with a better way of doing it or the glassfish folks fix their shit.)

Labels: glassfish, JPA, software, toplink

# posted by Nick : 7:09 PM 0 Comments

Wednesday, August 01, 2007

Another Dumbshit Toplink Error Message

When Toplink says "Trying to get value for instance variable [foo] of type [bar] from the object The specified object is not an instance of the class or interface declaring the underlying field" what it means is you specified the wrong mappedBy class in your many:many relationship, either explicitly or by inference from your generic type.

But it's much, much clearer to give their error message, isn't it.

Labels: software, toplink

# posted by Nick : 5:31 PM 1 Comments

Wednesday, July 04, 2007

Lucene Support in Tally-Ho

I've used Lucene for a really long time now. Practically since its Java-based inception, morons.org has searched articles using Lucene. My initial effort to add Lucene support to Tally-Ho failed miserably due to some unexpected behaviour from Lucene that I resolved by zeroing in on the problem with more unit tests.

Since all Article manipulation for the site goes through the ArticleService, that makes it a very handy place to automatically index Articles as they are created and updated and as they go through the lifecycle (from submitted, to approved, accepted, and so-on).

Adding an article to the index is trivial. We create a Directory object that points to where we'd like Lucene to write its files, create and IndexWriter to write them there, create several Field objects to represent the names of fields and their contents that we'd like to search on, add those fields to a Document, and add the Document to the IndexWriter. We can then query on any combination of these fields. Great.

I had a problem come when it becomes necessary to *change* an article. Lucene does this via the updateDocument method, or you can call deleteDocument and addDocument yourself. The advantage to updateDocument is that it's atomic. But for me, neither strategy worked at first.

First of all, even though Lucene said it was performing a delete based on a Term (which in our case contains the primary key of the Article), it didn't actually do it unless the Field referenced by the Term was stored as Field.Index.UN_TOKENIZED. If I stored it TOKENIZED, Lucene claims to be deleting, but the deleted Document would still show up in search queries.

Secondly, when I tried to delete a document, it looked like I could never add another document with the same fields ever again.

The first case turned out to be caused by using the StopAnalyzer to tokenize the input. When you index a term as UN_TOKENIZED, Lucene skips the Analyzer when storing the term to the index. The StopAnalyzer tokenizes only sequences of letters. Numbers are ignored. This differs from the StandardAnalyzer, which also uses stop words, which tokenizes letters as well as numbers. Since we delete based on the id term, which is numeric, Lucene was never finding the document it was supposed to delete, as the term had been tokenized into nothing by the StopAnalyzer... so the old document was not found and consequently not deleted.

The second case turned out to be caused by a fault in my unit test. I was doing an assertion that an updated article was in the database by doing a search on its id field. But I didn't assert that it was there before the update by searching the same way. For this reason it appeared that the article disappeared from the database and stayed away, because other unit tests worked (but those other tests also searched on other terms). Once I realized that the search on id was always failing, everything began to fall in place. Note also that you specify a tokenizer on search queries as well, so even when I stored the id term as UN_TOKENIZED, the StopAnalizer applied to the query would effectively eliminate the value of the search term (such that it could only ever find documents that had an empty id).

Lucene 2.2 has a great feature that lets you find documents in its index that are similar to a given document. The given document doesn't even need to be in the index, but it's very easy to do if it is. Since Tally-Ho automatically includes articles in the index as soon as they are created, this case applies. The code is very simple:


    directory = FSDirectory.getDirectory(indexDirectory);
    reader = IndexReader.open(directory);
    searcher = new IndexSearcher(directory);

    MoreLikeThis mlt = new MoreLikeThis(reader);
    mlt.setFieldNames(new String[]{"combined"});
            
    TermQuery findArticle = new TermQuery(new Term("id", String.valueOf(id)));
           
    Hits hits  = searcher.search(findArticle);
    int luceneDocumentId = hits.id(0);

    org.apache.lucene.search.Query query = mlt.like(luceneDocumentId);      
    hits = searcher.search(query);

I probably should be checking the first call to searcher.search to make sure the article for comparison is found (it should always be found, but sometime strange things happen).

Labels: Lucene, software

# posted by Nick : 1:02 AM 0 Comments

Thursday, April 26, 2007

Missing Logging in Wicket

Last night, the first early release of Tally-Ho hit morons.org. As one might expect, a few small problems turned up at the last minute, and most of these have been worked through. One of them was a strange Internal Error message, but there was no exception in my log file. It was getting late, and I was getting tired, so I fired off a message to the Wicket-Users list to see if anybody had advice.

The problem turned out to be that although my development container is Tomcat, which uses log4j for its logging and consequently configures a log4j root logger and appender, my deployment container is Resin Opensource, which does not.

The answer was to create a log4j.properties file in src/main/resources (so it is automatically included in the .war by Maven 2) with this in it:


log4j.rootLogger=WARN, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n

log4j.category.wicket=INFO
log4j.category.resource=INFO
log4j.category.wicket.protocol.http.RequestLogger=INFO
log4j.category.wicket.protocol.http.WicketServlet=INFO

Now my logging goes to stdout and is happily recorded by Resin.

Now if I could just get somewhere with WICKET-506.

Labels: log4j, resin, software, wicket

# posted by Nick : 7:27 AM 0 Comments

Saturday, April 21, 2007

Toplink Essentials: Buggier than a Roach Motel in Pensacola

Working with Toplink Essentials via JPAQL is quite a bit different than working with the commercial version of Toplink using its Expression class. With the commercial Toplink software, you generally get associated 1:1 objects fetched for you (ie eagerly rather than lazily) when you issue a query. In JPAQL, you get exactly what you ask for, which means if you want to get the associated objects in one query, you must use the JPAQL JOIN FETCH operator.

In my case, I needed LEFT JOIN FETCH, which works like an outer (left) join. My query ends up looking like this:

Select x from Article x LEFT JOIN FETCH x.messageBoardRoot where x.createDate > ?1 and not(x.status = ?2) order by x.createDate desc

Sometimes Articles won't have a message board associated with them, though usually they will. For example, there's no point in putting a message board on an article that is in a Pending state, since nobody can see it anyway.

Without the LEFT JOIN FETCH, Toplink issues one query to get the Articles, and then one query for every associated object. So if you're requesting 10 articles, you're going to get 11 queries. With the LEFT JOIN FETCH, it is supposed to consolidate everything into just enough queries to get what you ask for, and in fact the query it issues is reasonable:

SELECT t0.object_id, t0.thumbs_down, t0.spam_abuse, t0.MAILED, t0.change_summary, t0.VISIBLE, t0.ADJECTIVE, t0.BODY, t0.md5, t0.VIEWS, t0.fuzzy_md5_1, t0.VERSION, t0.fuzzy_md5_2, t0.thumbs_up, t0.create_date, t0.TITLE, t0.SUMMARY, t0.STATUS, t0.section, t0.changer, t0.creator, t1.object_id, t1.post_count, t1.last_post, t1.posting_permitted, t1.source_id, t1.post_count_24hr FROM ARTICLE t0 LEFT OUTER JOIN article_message_root t1 ON (t1.source_id = t0.object_id) WHERE ((t0.create_date > ?) AND NOT ((t0.STATUS = ?))) ORDER BY t0.create_date DESC
bind => [2007-04-14 14:46:15.593, P]

Unfortunately, Toplink's behaviour upon handling the results of running this query is NOT reasonable:


java.lang.NullPointerException
 at oracle.toplink.essentials.mappings.ForeignReferenceMapping.buildClone(ForeignReferenceMapping.java:122)
 at oracle.toplink.essentials.internal.descriptors.ObjectBuilder.populateAttributesForClone(ObjectBuilder.java:2136)
 at oracle.toplink.essentials.internal.sessions.UnitOfWorkImpl.populateAndRegisterObject(UnitOfWorkImpl.java:2836)

I've filed this one as https://glassfish.dev.java.net/issues/show_bug.cgi?id=2881. If past behaviour is any indication, the Glassfish people will change the priority on the bug to a P4 and decide not to fix it until we're all very old, despite it being a significant breakage of the API. They even pull that crap when the one-liner fix is already given in the bug report, and it would take longer to reset the priority and update the bug than it would to actually fix the damn problem.

Labels: glassfish, software, toplink, toplink essentials

# posted by Nick : 3:07 PM 1 Comments

How to make Eclipse, Tomcat, Maven 2 and Wicket play nice

On the off chance that other people find this helpful, here's how I set up Tally-Ho to work in Eclipse with the Sysdeo Tomcat plugin and Maven 2.

First, obviously, you need to install your prerequisites. Download and install Tomcat. Install the Sysdeo Tomcat plugin. You also want the Maven 2 Eclipse plugin. Installation of these is outside the scope of this post. It is also outside the scope of this post to explain Maven, Tomcat, Servlets and so-on. Use Google.

Next, bootstrap your project. I found it easiest to change into my Eclipse workspace directory, use mvn to create my archetype for my project, and then run mvn eclipse:eclipse inside the project directory it created. Then go to File | Import in Eclipse and import the project. Finally, enable the Maven 2 plugin for your imported project from the project's context menu, Maven 2 | Enable.

I found that the only way to make working with Maven bearable was to follow its default layout. This means that web.xml is going in src/main/webapp/WEB-INF and that all of the library dependencies are defined in pom.xml and all of the libraries will download into the Maven 2 Dependencies collection the first time you run mvn on the project.

Edit pom.xml and make sure you have your dependencies defined how you want them and that your project name and version are what you'd like.

Now is a good time to run a build of the project just to set up all of the remaining directories, like target. I did this by configuring an m2 build from the External Tools menu using my project's location as the Base directory with the goal "install".

Now set up the Tomcat plugin. From Window | Preferences | Tomcat, configure the appropriate Tomcat version and Tomcat home. From the context menu of your project, select Properties and then Tomcat. Check "is a Tomcat project." Set the context name to "/" and check "Can update context definition" and "Mark this context as reloadable." Set the subdirectory to "/target/your_project_name-your.project.version". The project name and version here must match what you've defined in pom.xml.

Now set up Eclipse to build directly to the Maven output directory. This allows you to avoid running a mvn build every time you make a change to a class or resource file. You will still need to run a mvn build if you add or change dependencies or if you change web.xml, however. From the project's properties context menu, choose Java Build Path and set the default output folder to target/your_project_name-your.project.version/WEB-INF/classes.

Lastly, from your project's context menu, choose Tomcat Project | Update Context Definition.

So to recap, here are the steps:

1. Download and install Maven 2, Eclipse, Tomcat, the Maven 2 plugin for Eclipse and the Tomcat plugin for Eclipse.
2. Create a project using Maven. Consult Maven's documentation for more detail or use an existing project that already has a pom.xml.
3. Run mvn eclipse:eclipse to generate Eclipse's metadata files from the project's pom.xml.
4. Import the project into your Eclipse workspace.
5. Edit pom.xml to define the version number and your dependencies.
6. Put web.xml in src/main/webapp/WEB-INF
7. Create a m2 external build and run it to create your target directory structure.
8. Configure the Tomcat plugin to look in Maven's target directory for your webapp's directory structure.
9. Configure Eclipse to build directly to Maven's target directory structure.

To make the setup play nicely with Wicket, you only need to define Wicket as a dependency in pom.xml (step 5). This is a matter of adding:


    <dependency>
      <groupId>wicket</groupId>
      <artifactId>wicket</artifactId>
      <version>1.2.5</version>
    </dependency>

Labels: eclipse, maven, software, tomcat, wicket

# posted by Nick : 8:15 AM 4 Comments

Monday, April 09, 2007

Why is Maven Still Such a Horrific Pile of Garbage?

Maven is, hands-down, the absolute worst piece of crapware I have had the misfortune of using in the last 4 years. This collection incidentally includes all versions of Microsoft Internet Explorer, including IE7, which only crashed every time I started it for a week due to an incompatibility with the Google Toolbar. It's worse than Norton Antivirus. It's worse than Microsoft Outlook. It is a complete waste of bits.

The terrible, tragic thing about Maven is that there's a kernel of a really good idea behind it. Building stuff, handling dependencies, running tests, producing reports. Great! Fantastic! If only it weren't to software development what Mr Garrison's "It" was to transit.

First, those who get excited about XML configuration need to die in a fire. A sewage fire. You know what? XML blows. The XML fad is over. Stop using XML for all kinds of garbage that it was never intended for. What the hell is wrong with you? People do not like writing this crap, and they like reading it even less. I don't give a damn that it makes your crapware XML/Object mapping tool spit out nice little objects that are easy for YOU to deal with when handling configuration. It's not about YOU if you want people to use your diarrhea soup.

Next, why does everything in this obtuse XML configuration HELL have to be

nested

and nested

and nested

and nested

and nested?

Seriously, if I need to get a file included in my output, why does it have to be in a structure 4 levels deep? And why do some of the bottom-layer elements allow file globbing? Don't you realize that if you can handle file globbing, you could just one ONE DAMN TAG ONE LAYER DEEP and be done with it? Die!

Want to see the results of your unit tests? Go look in a bunch of individual files! Because the build can only scream FAILURE!!! at you (just like that) and doesn't bother to tell you which assertion failed at which line in which class and method.

What a horrible pile of dung. Maven has been around for well over 4 years and in that time the only thing that appears to have improved is its startup time.

I don't know why anyone puts up with this crap.

Labels: crapware, dumbassware, maven, maven 2.0.6, shitware, software, turdware

# posted by Nick : 9:22 PM 6 Comments

Tuesday, April 03, 2007

Making MD5 Fuzzy, Redux

In my previous post, Making MD5 Fuzzy one thing I noted was a problem at the time was the capability of being off-by-one completely changing the outcome, such that certain small changes in the right places could cause the fuzzy md5 to no longer match up.

I struggled with the solution to this for quite a while, and then it dawned on me: I was looking at the problem the wrong way. It's fine if an off-by-one changes the outcome, if we're prepared to handle it.

The answer is to produce two checksums, not one! In the first, we begin at the beginning, and skip the last n/2 characters for an averaging length of n. In the second, we begin n/2 characters from the beginning and work all the way to the end.

Then instead of comparing one sum to another sum, we perform four comparisons:


object1.sum1 == object2.sum1
object1.sum2 == object2.sum1
object1.sum1 == object2.sum2
object1.sum2 == object2.sum2

If any of these statements returns true, we consider the objects to be "similar".

Here's the code. I've also simplified the way the distance between words is caculated and left room for non-english words to be handled at some point in the future (ie, there's no longer any special significance given to vowels).


package net.spatula.tally_ho.utils;


public class FuzzySum {
    
    private static final int SLOP = 3;

    private static FuzzySum instance;
    
    private static final int SAMPLE_SIZE = 10;
    
    private FuzzySum() {
        
    }
    
    public static synchronized FuzzySum getInstance() {
        if (instance == null) {
            instance = new FuzzySum();
        }
        return instance;
    }
    
    public String[] getSums(String text) {
        text = TextUtils.stripTags(text).toLowerCase().replaceAll("[^\\w\\s]", "").trim();
        
        if (text.length() < SAMPLE_SIZE * 1.5) {
            String md5 = TextUtils.md5(text);
            return new String[] { md5, md5 };
        }

        String[] words = text.split("(?s)\\s+");
        
        String md5_1 = calculateFuzzyMd5(words, 0, words.length - 1 - (SAMPLE_SIZE / 2));
        String md5_2 = calculateFuzzyMd5(words, SAMPLE_SIZE / 2, words.length - 1);
        
        return new String[] {md5_1, md5_2};
    }
    
    private String calculateFuzzyMd5(String[] input, int startIndex, int endIndex) {
        StringBuilder builder = new StringBuilder();
         
        int distanceSum = 0;
        for (int i = startIndex + 1; i<= endIndex; i++) {
            String thisWord = input[i];
            String lastWord = input[i - 1];
            
            distanceSum += calculateDistance(thisWord, lastWord);
            if (i % SAMPLE_SIZE == 0) {
                if (builder.length() > 0) {
                    builder.append("\n");
                }
                builder.append(distanceSum / SAMPLE_SIZE);
                distanceSum = 0;
            }
        }
        
        if (distanceSum != 0) {
            builder.append("\n");
            builder.append(distanceSum / (endIndex + 1 - startIndex % SAMPLE_SIZE));
        }
        
        return TextUtils.md5(builder.toString());
    }

    private int calculateDistance(String word1, String word2){
        int word1Sum = calculateWordSum(word1);
        int word2Sum = calculateWordSum(word2);
        return Math.abs(word1Sum - word2Sum) / SLOP;
    }
    
    private int calculateWordSum(String word) {
        
        if (word.length() == 1) {
            return (int)(word.charAt(0)) & 0xffff;
        }
        
        int wordSum = 0;
        for (int i = 1; i < word.length(); i++) {
            int prevChar = (int)(word.charAt(i-1)) & 0xffff;
            int thisChar = (int)(word.charAt(i)) & 0xffff;
            wordSum += Math.abs(thisChar - prevChar);
        }
        
        return SLOP * wordSum / word.length();
    }
    
}

As you can see, this code has been committed as part of the Tally-Ho project, https://tally-ho.dev.java.net/

Labels: checksum, fuzzy, java, md5, software, tally-ho

# posted by Nick : 8:08 PM 0 Comments

FreeBSD Network Performance Tuning

I've been tweaking the network stack on my FreeBSD host for many moons now, trying to get everything "just right" for optimal network performance. Many of the defaults are a bit pessimistic, assuming a network that experiences a good deal of packet loss and transmits data over a twisted pair of doorbell wire from a PDP-11 in the damp basement of some godforsaken computer lab to a VAX machine surrounded by nerds in a Physics building 2500 miles away. Sure, that may have been a common scenario back in 1982 or whatever, but these days most networks are much more reliable, delivering far more porn at faster rates than ever before.

My tuning is focused mainly on high-performance web serving on a host that also makes connections via localhost for database access and to front-end Resin OpenSource (a Java Servlet container) with Apache. The host has plenty of RAM and CPU available. These tunings may not be appropriate for all situations, so use your head.

First, enable polling on your interface. While you're at it, compile in zero copy sockets and the http accept filter. In fact, just add this crap to your kernel config if it isn't already there:


options         HZ=1000
options         DEVICE_POLLING
options         ACCEPT_FILTER_HTTP
options         ZERO_COPY_SOCKETS

To make sure your device actually polls, edit /etc/rc.conf and add "polling" at the end of ifconfig_{yourInterface}; eg:


ifconfig_bge0="inet 192.168.1.234 netmask 255.255.255.0 polling"

You probably also will want to tune polling a bit with sysctl:


kern.polling.burst_max=1000
kern.polling.idle_poll=0
kern.polling.each_burst=50

Idle poll tends to keep your CPU busy 100% of the time. For best results, keep kern.polling.each_burst <= the value of net.inet.ip.intr_queue_maxlen, normally 50.

Now sit down and think about what bandwidth and latency you want to plan for. This kinda depends a bit on who typically accesses your host. Are they coming from broadband connections mainly? About how far away are they usually? You can get some assistance with this determination by doing a sysctl net.inet.tcp.hostcache.list. Starting in FreeBSD 5.3, hostcache began keeping track of the usual RTT and Bandwidth available for all of the IP addresses it heard from in the last hour (to a limit of course, which is tuneable... more on that later).

We would be interested in the RTT and BANDWIDTH columns, if the number in the BANDWIDTH column had any bearing on reality whatsoever. Since my hostcache routinely suggests that there's more bandwidth available to a remote host than is actually possible given my machine's uplink, it isn't really reasonable to use this number. You can, however, average the RTT to get a rough idea of the average RTT to the current set of users in your hostcache. You can also get a rough idea of the average TCP congestion window size (CWND). Note that this will be bounded by what you have set for net.inet.tcp.sendspace and net.inet.tcp.recvspace. To make sure you're not the bottleneck, you could try setting these two to an unreasonably high number, like 373760, for an hour to collect the data. You can do a sysctl -w net.inet.tcp.hostcache.purge=1 to clear the old hostcache data if you decide to do this.

Here's a dumb little perl script for calculating your average and median RTT, CWND and Max CWND:


open(IN, "/sbin/sysctl net.inet.tcp.hostcache.list |");

while (<IN>) {
    @columns = split(/\s+/, $_);
    next if ($columns[0] eq '127.0.0.1');
    next if ($columns[0] eq 'IP');

    next if ($columns[9] < 2 || $columns[10] < 2);  # skip if few hits and few updates

    push(@rtts, int($columns[3]));
    push(@cwnds, $columns[6]);

    $rttSum += int($columns[3]);
    $cwndSum += $columns[6];
    $cwndMax = $columns[6] if $columns[6] > $cwndMax;

    $entries++;
}

print "Average RTT = " . int($rttSum / $entries) . "\n";
print "Average CWND = " . int($cwndSum / $entries) . "\n";
print "Max CWND = $cwndMax \n";

@rtts = sort { $a <=> $b } @rtts;
@cwnds = sort { $a <=> $b } @cwnds;

print "Median RTT = " . getMedian(@rtts) . "\n";
print "Median CWND = " . getMedian(@cwnds) . "\n";

sub getMedian {
    my @list = @_;
    if (@list % 2 == 1) {
       return $list[@list / 2];
    }  else {
       return ($list[@list / 2 - 1] + $list [@list / 2]) / 2;
    }
}

It's up to you how to use the information the script provides. For me, the most interesting thing to note is that my median RTT is around 100ms and that my max CWND looks to be 122640, at least for the hosts currently in my host cache.

I want to optimize my site for the best possible experience for high speed broadband users.. My home broadband connection is 8Mbps, but it can burst up to 12Mbps for a short time. If we split the difference, that's 10Mbps. This is probably a bit optimistic for most home broadband users. Also note that there's no point in optimizing for more bandwidth than your host actually HAS. In my case, my uplink is 10Mbps, so there's no point in trying to optimize for a 45Mbps connection.

In all probability I won't be able to actually push 10Mbps because I share that connection with some other folks. So let's be just a little bit pessimistic and optimize for 6Mbps. Many home cable services provide between 4 and 8 Mbps downstream, so 6Mbps is a nice "middle of the road" approximation.

To calculate the bandwidth delay product, we take the speed in kbps and multiply it by the latency in ms. In this case, that is 6144 * 100 or 614400. To get the number of bytes for a congestion window that many bits wide, divide by 8. This gives us 76800, the number of bytes we can expect to send before receiving an acknowledgment for the first packet. That's higher than both the median and average congestion window sizes for the folks currently in my hostcache, and about 2/3 of the max. Remember this number.

The next thing to look at is the net.inet.tcp.mssdflt. This is the maximum segment size used when no better information is available. Normally this is set pessimistically low. These days, most networks are capable of moving packets of 1500 bytes, so let's set this to 1460 (1500 minus 40 bytes for headers). sysctl -w net.inet.tcp.mssdflt=1460. This could make the first few packets fail to transmit should MSS negotiation at the start of a TCP connection not happen for some reason or if a network cannot support a packet of that size. I suspect this is quite rare. And we're trying to optimize for the most common case, not the most pessimistic case.

Now we want to make sure that our congestion window size is an even multiple of the default MSS. In fact it isn't. 76800 / 1460 is 52.6027. We round up to the nearest even number - 54 - and multiply by the MSS to get 78840. (I'm not sure why, but many sites recommend that one use an even multiple of MSS.) I round up rather than down because I'm optimistic that I will not have lost that first packet in transit. Rounding down might mean stopping and waiting for the first acknowledgment rather than continuing with one (or two) more packets while awaiting that first reply.

Now that we have our desired window size, let's set it:


sysctl -w net.inet.tcp.recvspace=78840
sysctl -w net.inet.tcp.sendspace=78840

Since we're being optimistic, let's assume that the very first time we talk to our peer, we can completely fill up the window with data. Recall that we can fit 54 packets into 78840 bytes, so we can do this:


net.inet.tcp.slowstart_flightsize=54

Granted, immediately jamming the pipe with packets might be considered antisocial by cranky network administrators who don't like to see retransmissions in the event of an error, but more often than not, these packets will go through without error. I never minded being antisocial. If it really bothers you, cut this number in half. Note that having RFC3390 enabled (as it is by default) and functioning on a connection means that this value isn't used on new connections.

Next, turn on TCP delayed ACK and double the delayed ACK time. This makes it more likely that the first response packet will be able to have the first ACK piggybacked onto it, without overdoing the delay:


net.inet.tcp.delayed_ack=1
net.inet.tcp.delacktime=100

Now enable TCP inflight. The manual page recommends using an inflight.min of 6144:


net.inet.tcp.inflight.enable=1
net.inet.tcp.inflight.min=6144

Finally some tuning for the loopback. Hosts (like mine) that do a lot of connections to localhost may benefit from these. First I modify the ifconfig entry for lo0 to include "mtu 8232" (programs commonly use 8192-byte buffers for communicating across localhost, add 40 bytes for header). Using a similar strategy to what we did above, I tune the following in sysctl.conf:


net.local.stream.sendspace=82320
net.local.stream.recvspace=82320
net.inet.tcp.local_slowstart_flightsize=10
net.inet.tcp.nolocaltimewait=1

The 10 is arbitrary, but it's also the smallest even multiple that makes the loopback window equal or greater in size than the LAN interface window. There might be some small advantage in doing this if there are programs which may copy the incoming request to some other program via the loopback.

Adding net.inet.tcp.nolocaltimewait frees up resources more quickly for connections on the loopback.

Finally, make the host cache last a bit longer:


net.inet.tcp.hostcache.expire=3900

The reason I do this is that some hosts may connect once an hour automatically. Increasing the time slightly increases the chances that such hosts would be able to take advantage of the hostcache. If you like, you can also increase the size of this hash to allow for more entries. I do this for the TCP TCB hash as well. These have to be changed in /boot/loader.conf as they can't be changed once the kernel is running:


net.inet.tcp.tcbhashsize="4096"
net.inet.tcp.hostcache.hashsize="1024"

So that's it. If these settings are applicable to you, you can just add this to /etc/sysctl.conf:


net.local.stream.sendspace=82320
net.local.stream.recvspace=82320
net.inet.tcp.local_slowstart_flightsize=10
net.inet.tcp.nolocaltimewait=1

net.inet.tcp.delayed_ack=1
net.inet.tcp.delacktime=100

net.inet.tcp.mssdflt=1460
net.inet.tcp.sendspace=78840
net.inet.tcp.recvspace=78840
net.inet.tcp.slowstart_flightsize=54

net.inet.tcp.inflight.enable=1
net.inet.tcp.inflight.min=6144

kern.polling.burst_max=1000
kern.polling.idle_poll=0
kern.polling.each_burst=50

net.inet.tcp.hostcache.expire=3900

And don't forget to edit /etc/rc.conf and add "mtu 8232" for your ifconfig_lo0 line and "polling" for your LAN adaptor.

Nick Johnson

Friday, November 23, 2007

A Major Milestone for Tally-Ho: Arbitrary HTML Pages

Sunday, November 11, 2007

What's New in Tally-Ho?

Sunday, November 04, 2007

Time for a Divorce

Saturday, November 03, 2007

How to do Static Weaving with Toplink Essentials from Maven 2

Monday, October 29, 2007

The Most Convoluted Problem I Ever Solved

Tuesday, October 16, 2007

Tally-Ho Turns 1

Monday, September 17, 2007

Why Should a Heartbeat be Regular?

Tuesday, August 14, 2007

Keeping Simple Things Simple

Toplink Query Deficiencies

Wednesday, August 01, 2007

Another Dumbshit Toplink Error Message

Wednesday, July 04, 2007

Lucene Support in Tally-Ho

Thursday, April 26, 2007

Missing Logging in Wicket

Saturday, April 21, 2007

Toplink Essentials: Buggier than a Roach Motel in Pensacola

How to make Eclipse, Tomcat, Maven 2 and Wicket play nice

Monday, April 09, 2007

Why is Maven Still Such a Horrific Pile of Garbage?

Tuesday, April 03, 2007

Making MD5 Fuzzy, Redux

FreeBSD Network Performance Tuning

Sunday, March 11, 2007

JPA + J2SE Servlet Containers = Impossible

Tuesday, March 06, 2007

More Adventures with JPA

Toplink Essentials: Not Ready for Prime Time

Sunday, March 04, 2007

Open JPA Can't @OrderBy a Related Object's @Id

Toplink's Weaving is Broken

Saturday, March 03, 2007

Optimizing the Message Tree with JPA

Thursday, March 01, 2007

New Adventures in JPA with WTP 2.0M5

Monday, February 26, 2007

Completing the Tree

Sunday, February 25, 2007

The Message Tree Lives

Sunday, January 21, 2007

Of All The Confounded Stupidity

Tuesday, December 05, 2006

Making MD5 Fuzzy

Saturday, December 02, 2006

More Wicket Stupidity

Thursday, November 30, 2006

The First Braindead Thing I've Seen in Wicket

Monday, November 20, 2006

Clever Title Using The Word "Fragments"

Sunday, November 19, 2006

Basic article submission works; thoughts on service layer

Friday, November 17, 2006

Alternative Method of Passing Along Model Objects

Tuesday, November 14, 2006

Something to Remember with Generic Authentication

Monday, November 13, 2006

Adding Generic Authorization to a Wicket Application

Friday, November 10, 2006

A new name, a new license, a new home...

Wednesday, November 08, 2006

I Want Your SAX; JPA ResourceManager

Monday, November 06, 2006

Better than AJAX: adding client-side-only behavior to form components

Saturday, November 04, 2006

Wicket Validation Revisited; AJAX Comes to morons.org; Unit Testing

Thursday, November 02, 2006

Attention Toplink Authors

Wednesday, November 01, 2006

The Disintegration of Persistence?

Tuesday, October 31, 2006

A View to an Article