Nick Johnson: April 2007

Saturday, April 28, 2007

my gross obesity

# posted by Nick : 8:11 AM 2 Comments

Thursday, April 26, 2007

Missing Logging in Wicket

Last night, the first early release of Tally-Ho hit morons.org. As one might expect, a few small problems turned up at the last minute, and most of these have been worked through. One of them was a strange Internal Error message, but there was no exception in my log file. It was getting late, and I was getting tired, so I fired off a message to the Wicket-Users list to see if anybody had advice.

The problem turned out to be that although my development container is Tomcat, which uses log4j for its logging and consequently configures a log4j root logger and appender, my deployment container is Resin Opensource, which does not.

The answer was to create a log4j.properties file in src/main/resources (so it is automatically included in the .war by Maven 2) with this in it:


log4j.rootLogger=WARN, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n

log4j.category.wicket=INFO
log4j.category.resource=INFO
log4j.category.wicket.protocol.http.RequestLogger=INFO
log4j.category.wicket.protocol.http.WicketServlet=INFO

Now my logging goes to stdout and is happily recorded by Resin.

Now if I could just get somewhere with WICKET-506.

Labels: log4j, resin, software, wicket

# posted by Nick : 7:27 AM 0 Comments

Wednesday, April 25, 2007

LOL @ AFA

AngrySkul puts it all in perspective.

Labels: AFA, hate

# posted by Nick : 2:43 PM 0 Comments

Saturday, April 21, 2007

Toplink Essentials: Buggier than a Roach Motel in Pensacola

Working with Toplink Essentials via JPAQL is quite a bit different than working with the commercial version of Toplink using its Expression class. With the commercial Toplink software, you generally get associated 1:1 objects fetched for you (ie eagerly rather than lazily) when you issue a query. In JPAQL, you get exactly what you ask for, which means if you want to get the associated objects in one query, you must use the JPAQL JOIN FETCH operator.

In my case, I needed LEFT JOIN FETCH, which works like an outer (left) join. My query ends up looking like this:

Select x from Article x LEFT JOIN FETCH x.messageBoardRoot where x.createDate > ?1 and not(x.status = ?2) order by x.createDate desc

Sometimes Articles won't have a message board associated with them, though usually they will. For example, there's no point in putting a message board on an article that is in a Pending state, since nobody can see it anyway.

Without the LEFT JOIN FETCH, Toplink issues one query to get the Articles, and then one query for every associated object. So if you're requesting 10 articles, you're going to get 11 queries. With the LEFT JOIN FETCH, it is supposed to consolidate everything into just enough queries to get what you ask for, and in fact the query it issues is reasonable:

SELECT t0.object_id, t0.thumbs_down, t0.spam_abuse, t0.MAILED, t0.change_summary, t0.VISIBLE, t0.ADJECTIVE, t0.BODY, t0.md5, t0.VIEWS, t0.fuzzy_md5_1, t0.VERSION, t0.fuzzy_md5_2, t0.thumbs_up, t0.create_date, t0.TITLE, t0.SUMMARY, t0.STATUS, t0.section, t0.changer, t0.creator, t1.object_id, t1.post_count, t1.last_post, t1.posting_permitted, t1.source_id, t1.post_count_24hr FROM ARTICLE t0 LEFT OUTER JOIN article_message_root t1 ON (t1.source_id = t0.object_id) WHERE ((t0.create_date > ?) AND NOT ((t0.STATUS = ?))) ORDER BY t0.create_date DESC
bind => [2007-04-14 14:46:15.593, P]

Unfortunately, Toplink's behaviour upon handling the results of running this query is NOT reasonable:


java.lang.NullPointerException
 at oracle.toplink.essentials.mappings.ForeignReferenceMapping.buildClone(ForeignReferenceMapping.java:122)
 at oracle.toplink.essentials.internal.descriptors.ObjectBuilder.populateAttributesForClone(ObjectBuilder.java:2136)
 at oracle.toplink.essentials.internal.sessions.UnitOfWorkImpl.populateAndRegisterObject(UnitOfWorkImpl.java:2836)

I've filed this one as https://glassfish.dev.java.net/issues/show_bug.cgi?id=2881. If past behaviour is any indication, the Glassfish people will change the priority on the bug to a P4 and decide not to fix it until we're all very old, despite it being a significant breakage of the API. They even pull that crap when the one-liner fix is already given in the bug report, and it would take longer to reset the priority and update the bug than it would to actually fix the damn problem.

Labels: glassfish, software, toplink, toplink essentials

# posted by Nick : 3:07 PM 1 Comments

How to make Eclipse, Tomcat, Maven 2 and Wicket play nice

On the off chance that other people find this helpful, here's how I set up Tally-Ho to work in Eclipse with the Sysdeo Tomcat plugin and Maven 2.

First, obviously, you need to install your prerequisites. Download and install Tomcat. Install the Sysdeo Tomcat plugin. You also want the Maven 2 Eclipse plugin. Installation of these is outside the scope of this post. It is also outside the scope of this post to explain Maven, Tomcat, Servlets and so-on. Use Google.

Next, bootstrap your project. I found it easiest to change into my Eclipse workspace directory, use mvn to create my archetype for my project, and then run mvn eclipse:eclipse inside the project directory it created. Then go to File | Import in Eclipse and import the project. Finally, enable the Maven 2 plugin for your imported project from the project's context menu, Maven 2 | Enable.

I found that the only way to make working with Maven bearable was to follow its default layout. This means that web.xml is going in src/main/webapp/WEB-INF and that all of the library dependencies are defined in pom.xml and all of the libraries will download into the Maven 2 Dependencies collection the first time you run mvn on the project.

Edit pom.xml and make sure you have your dependencies defined how you want them and that your project name and version are what you'd like.

Now is a good time to run a build of the project just to set up all of the remaining directories, like target. I did this by configuring an m2 build from the External Tools menu using my project's location as the Base directory with the goal "install".

Now set up the Tomcat plugin. From Window | Preferences | Tomcat, configure the appropriate Tomcat version and Tomcat home. From the context menu of your project, select Properties and then Tomcat. Check "is a Tomcat project." Set the context name to "/" and check "Can update context definition" and "Mark this context as reloadable." Set the subdirectory to "/target/your_project_name-your.project.version". The project name and version here must match what you've defined in pom.xml.

Now set up Eclipse to build directly to the Maven output directory. This allows you to avoid running a mvn build every time you make a change to a class or resource file. You will still need to run a mvn build if you add or change dependencies or if you change web.xml, however. From the project's properties context menu, choose Java Build Path and set the default output folder to target/your_project_name-your.project.version/WEB-INF/classes.

Lastly, from your project's context menu, choose Tomcat Project | Update Context Definition.

So to recap, here are the steps:

1. Download and install Maven 2, Eclipse, Tomcat, the Maven 2 plugin for Eclipse and the Tomcat plugin for Eclipse.
2. Create a project using Maven. Consult Maven's documentation for more detail or use an existing project that already has a pom.xml.
3. Run mvn eclipse:eclipse to generate Eclipse's metadata files from the project's pom.xml.
4. Import the project into your Eclipse workspace.
5. Edit pom.xml to define the version number and your dependencies.
6. Put web.xml in src/main/webapp/WEB-INF
7. Create a m2 external build and run it to create your target directory structure.
8. Configure the Tomcat plugin to look in Maven's target directory for your webapp's directory structure.
9. Configure Eclipse to build directly to Maven's target directory structure.

To make the setup play nicely with Wicket, you only need to define Wicket as a dependency in pom.xml (step 5). This is a matter of adding:


    <dependency>
      <groupId>wicket</groupId>
      <artifactId>wicket</artifactId>
      <version>1.2.5</version>
    </dependency>

Labels: eclipse, maven, software, tomcat, wicket

# posted by Nick : 8:15 AM 4 Comments

Thursday, April 19, 2007

edna with her boyfriend rich

# posted by Nick : 8:58 PM 0 Comments

Tuesday, April 10, 2007

ace and maruko holding my laundry down

# posted by Nick : 9:27 PM 0 Comments

lady ace shall have her bath drawn now

# posted by Nick : 5:58 PM 0 Comments

madam ace

# posted by Nick : 8:36 AM 0 Comments

Monday, April 09, 2007

Why is Maven Still Such a Horrific Pile of Garbage?

Maven is, hands-down, the absolute worst piece of crapware I have had the misfortune of using in the last 4 years. This collection incidentally includes all versions of Microsoft Internet Explorer, including IE7, which only crashed every time I started it for a week due to an incompatibility with the Google Toolbar. It's worse than Norton Antivirus. It's worse than Microsoft Outlook. It is a complete waste of bits.

The terrible, tragic thing about Maven is that there's a kernel of a really good idea behind it. Building stuff, handling dependencies, running tests, producing reports. Great! Fantastic! If only it weren't to software development what Mr Garrison's "It" was to transit.

First, those who get excited about XML configuration need to die in a fire. A sewage fire. You know what? XML blows. The XML fad is over. Stop using XML for all kinds of garbage that it was never intended for. What the hell is wrong with you? People do not like writing this crap, and they like reading it even less. I don't give a damn that it makes your crapware XML/Object mapping tool spit out nice little objects that are easy for YOU to deal with when handling configuration. It's not about YOU if you want people to use your diarrhea soup.

Next, why does everything in this obtuse XML configuration HELL have to be

nested

and nested

and nested

and nested

and nested?

Seriously, if I need to get a file included in my output, why does it have to be in a structure 4 levels deep? And why do some of the bottom-layer elements allow file globbing? Don't you realize that if you can handle file globbing, you could just one ONE DAMN TAG ONE LAYER DEEP and be done with it? Die!

Want to see the results of your unit tests? Go look in a bunch of individual files! Because the build can only scream FAILURE!!! at you (just like that) and doesn't bother to tell you which assertion failed at which line in which class and method.

What a horrible pile of dung. Maven has been around for well over 4 years and in that time the only thing that appears to have improved is its startup time.

I don't know why anyone puts up with this crap.

Labels: crapware, dumbassware, maven, maven 2.0.6, shitware, software, turdware

# posted by Nick : 9:22 PM 6 Comments

Tuesday, April 03, 2007

Making MD5 Fuzzy, Redux

In my previous post, Making MD5 Fuzzy one thing I noted was a problem at the time was the capability of being off-by-one completely changing the outcome, such that certain small changes in the right places could cause the fuzzy md5 to no longer match up.

I struggled with the solution to this for quite a while, and then it dawned on me: I was looking at the problem the wrong way. It's fine if an off-by-one changes the outcome, if we're prepared to handle it.

The answer is to produce two checksums, not one! In the first, we begin at the beginning, and skip the last n/2 characters for an averaging length of n. In the second, we begin n/2 characters from the beginning and work all the way to the end.

Then instead of comparing one sum to another sum, we perform four comparisons:


object1.sum1 == object2.sum1
object1.sum2 == object2.sum1
object1.sum1 == object2.sum2
object1.sum2 == object2.sum2

If any of these statements returns true, we consider the objects to be "similar".

Here's the code. I've also simplified the way the distance between words is caculated and left room for non-english words to be handled at some point in the future (ie, there's no longer any special significance given to vowels).


package net.spatula.tally_ho.utils;


public class FuzzySum {
    
    private static final int SLOP = 3;

    private static FuzzySum instance;
    
    private static final int SAMPLE_SIZE = 10;
    
    private FuzzySum() {
        
    }
    
    public static synchronized FuzzySum getInstance() {
        if (instance == null) {
            instance = new FuzzySum();
        }
        return instance;
    }
    
    public String[] getSums(String text) {
        text = TextUtils.stripTags(text).toLowerCase().replaceAll("[^\\w\\s]", "").trim();
        
        if (text.length() < SAMPLE_SIZE * 1.5) {
            String md5 = TextUtils.md5(text);
            return new String[] { md5, md5 };
        }

        String[] words = text.split("(?s)\\s+");
        
        String md5_1 = calculateFuzzyMd5(words, 0, words.length - 1 - (SAMPLE_SIZE / 2));
        String md5_2 = calculateFuzzyMd5(words, SAMPLE_SIZE / 2, words.length - 1);
        
        return new String[] {md5_1, md5_2};
    }
    
    private String calculateFuzzyMd5(String[] input, int startIndex, int endIndex) {
        StringBuilder builder = new StringBuilder();
         
        int distanceSum = 0;
        for (int i = startIndex + 1; i<= endIndex; i++) {
            String thisWord = input[i];
            String lastWord = input[i - 1];
            
            distanceSum += calculateDistance(thisWord, lastWord);
            if (i % SAMPLE_SIZE == 0) {
                if (builder.length() > 0) {
                    builder.append("\n");
                }
                builder.append(distanceSum / SAMPLE_SIZE);
                distanceSum = 0;
            }
        }
        
        if (distanceSum != 0) {
            builder.append("\n");
            builder.append(distanceSum / (endIndex + 1 - startIndex % SAMPLE_SIZE));
        }
        
        return TextUtils.md5(builder.toString());
    }

    private int calculateDistance(String word1, String word2){
        int word1Sum = calculateWordSum(word1);
        int word2Sum = calculateWordSum(word2);
        return Math.abs(word1Sum - word2Sum) / SLOP;
    }
    
    private int calculateWordSum(String word) {
        
        if (word.length() == 1) {
            return (int)(word.charAt(0)) & 0xffff;
        }
        
        int wordSum = 0;
        for (int i = 1; i < word.length(); i++) {
            int prevChar = (int)(word.charAt(i-1)) & 0xffff;
            int thisChar = (int)(word.charAt(i)) & 0xffff;
            wordSum += Math.abs(thisChar - prevChar);
        }
        
        return SLOP * wordSum / word.length();
    }
    
}

As you can see, this code has been committed as part of the Tally-Ho project, https://tally-ho.dev.java.net/

Labels: checksum, fuzzy, java, md5, software, tally-ho

# posted by Nick : 8:08 PM 0 Comments

FreeBSD Network Performance Tuning

I've been tweaking the network stack on my FreeBSD host for many moons now, trying to get everything "just right" for optimal network performance. Many of the defaults are a bit pessimistic, assuming a network that experiences a good deal of packet loss and transmits data over a twisted pair of doorbell wire from a PDP-11 in the damp basement of some godforsaken computer lab to a VAX machine surrounded by nerds in a Physics building 2500 miles away. Sure, that may have been a common scenario back in 1982 or whatever, but these days most networks are much more reliable, delivering far more porn at faster rates than ever before.

My tuning is focused mainly on high-performance web serving on a host that also makes connections via localhost for database access and to front-end Resin OpenSource (a Java Servlet container) with Apache. The host has plenty of RAM and CPU available. These tunings may not be appropriate for all situations, so use your head.

First, enable polling on your interface. While you're at it, compile in zero copy sockets and the http accept filter. In fact, just add this crap to your kernel config if it isn't already there:


options         HZ=1000
options         DEVICE_POLLING
options         ACCEPT_FILTER_HTTP
options         ZERO_COPY_SOCKETS

To make sure your device actually polls, edit /etc/rc.conf and add "polling" at the end of ifconfig_{yourInterface}; eg:


ifconfig_bge0="inet 192.168.1.234 netmask 255.255.255.0 polling"

You probably also will want to tune polling a bit with sysctl:


kern.polling.burst_max=1000
kern.polling.idle_poll=0
kern.polling.each_burst=50

Idle poll tends to keep your CPU busy 100% of the time. For best results, keep kern.polling.each_burst <= the value of net.inet.ip.intr_queue_maxlen, normally 50.

Now sit down and think about what bandwidth and latency you want to plan for. This kinda depends a bit on who typically accesses your host. Are they coming from broadband connections mainly? About how far away are they usually? You can get some assistance with this determination by doing a sysctl net.inet.tcp.hostcache.list. Starting in FreeBSD 5.3, hostcache began keeping track of the usual RTT and Bandwidth available for all of the IP addresses it heard from in the last hour (to a limit of course, which is tuneable... more on that later).

We would be interested in the RTT and BANDWIDTH columns, if the number in the BANDWIDTH column had any bearing on reality whatsoever. Since my hostcache routinely suggests that there's more bandwidth available to a remote host than is actually possible given my machine's uplink, it isn't really reasonable to use this number. You can, however, average the RTT to get a rough idea of the average RTT to the current set of users in your hostcache. You can also get a rough idea of the average TCP congestion window size (CWND). Note that this will be bounded by what you have set for net.inet.tcp.sendspace and net.inet.tcp.recvspace. To make sure you're not the bottleneck, you could try setting these two to an unreasonably high number, like 373760, for an hour to collect the data. You can do a sysctl -w net.inet.tcp.hostcache.purge=1 to clear the old hostcache data if you decide to do this.

Here's a dumb little perl script for calculating your average and median RTT, CWND and Max CWND:


open(IN, "/sbin/sysctl net.inet.tcp.hostcache.list |");

while (<IN>) {
    @columns = split(/\s+/, $_);
    next if ($columns[0] eq '127.0.0.1');
    next if ($columns[0] eq 'IP');

    next if ($columns[9] < 2 || $columns[10] < 2);  # skip if few hits and few updates

    push(@rtts, int($columns[3]));
    push(@cwnds, $columns[6]);

    $rttSum += int($columns[3]);
    $cwndSum += $columns[6];
    $cwndMax = $columns[6] if $columns[6] > $cwndMax;

    $entries++;
}

print "Average RTT = " . int($rttSum / $entries) . "\n";
print "Average CWND = " . int($cwndSum / $entries) . "\n";
print "Max CWND = $cwndMax \n";

@rtts = sort { $a <=> $b } @rtts;
@cwnds = sort { $a <=> $b } @cwnds;

print "Median RTT = " . getMedian(@rtts) . "\n";
print "Median CWND = " . getMedian(@cwnds) . "\n";

sub getMedian {
    my @list = @_;
    if (@list % 2 == 1) {
       return $list[@list / 2];
    }  else {
       return ($list[@list / 2 - 1] + $list [@list / 2]) / 2;
    }
}

It's up to you how to use the information the script provides. For me, the most interesting thing to note is that my median RTT is around 100ms and that my max CWND looks to be 122640, at least for the hosts currently in my host cache.

I want to optimize my site for the best possible experience for high speed broadband users.. My home broadband connection is 8Mbps, but it can burst up to 12Mbps for a short time. If we split the difference, that's 10Mbps. This is probably a bit optimistic for most home broadband users. Also note that there's no point in optimizing for more bandwidth than your host actually HAS. In my case, my uplink is 10Mbps, so there's no point in trying to optimize for a 45Mbps connection.

In all probability I won't be able to actually push 10Mbps because I share that connection with some other folks. So let's be just a little bit pessimistic and optimize for 6Mbps. Many home cable services provide between 4 and 8 Mbps downstream, so 6Mbps is a nice "middle of the road" approximation.

To calculate the bandwidth delay product, we take the speed in kbps and multiply it by the latency in ms. In this case, that is 6144 * 100 or 614400. To get the number of bytes for a congestion window that many bits wide, divide by 8. This gives us 76800, the number of bytes we can expect to send before receiving an acknowledgment for the first packet. That's higher than both the median and average congestion window sizes for the folks currently in my hostcache, and about 2/3 of the max. Remember this number.

The next thing to look at is the net.inet.tcp.mssdflt. This is the maximum segment size used when no better information is available. Normally this is set pessimistically low. These days, most networks are capable of moving packets of 1500 bytes, so let's set this to 1460 (1500 minus 40 bytes for headers). sysctl -w net.inet.tcp.mssdflt=1460. This could make the first few packets fail to transmit should MSS negotiation at the start of a TCP connection not happen for some reason or if a network cannot support a packet of that size. I suspect this is quite rare. And we're trying to optimize for the most common case, not the most pessimistic case.

Now we want to make sure that our congestion window size is an even multiple of the default MSS. In fact it isn't. 76800 / 1460 is 52.6027. We round up to the nearest even number - 54 - and multiply by the MSS to get 78840. (I'm not sure why, but many sites recommend that one use an even multiple of MSS.) I round up rather than down because I'm optimistic that I will not have lost that first packet in transit. Rounding down might mean stopping and waiting for the first acknowledgment rather than continuing with one (or two) more packets while awaiting that first reply.

Now that we have our desired window size, let's set it:


sysctl -w net.inet.tcp.recvspace=78840
sysctl -w net.inet.tcp.sendspace=78840

Since we're being optimistic, let's assume that the very first time we talk to our peer, we can completely fill up the window with data. Recall that we can fit 54 packets into 78840 bytes, so we can do this:


net.inet.tcp.slowstart_flightsize=54

Granted, immediately jamming the pipe with packets might be considered antisocial by cranky network administrators who don't like to see retransmissions in the event of an error, but more often than not, these packets will go through without error. I never minded being antisocial. If it really bothers you, cut this number in half. Note that having RFC3390 enabled (as it is by default) and functioning on a connection means that this value isn't used on new connections.

Next, turn on TCP delayed ACK and double the delayed ACK time. This makes it more likely that the first response packet will be able to have the first ACK piggybacked onto it, without overdoing the delay:


net.inet.tcp.delayed_ack=1
net.inet.tcp.delacktime=100

Now enable TCP inflight. The manual page recommends using an inflight.min of 6144:


net.inet.tcp.inflight.enable=1
net.inet.tcp.inflight.min=6144

Finally some tuning for the loopback. Hosts (like mine) that do a lot of connections to localhost may benefit from these. First I modify the ifconfig entry for lo0 to include "mtu 8232" (programs commonly use 8192-byte buffers for communicating across localhost, add 40 bytes for header). Using a similar strategy to what we did above, I tune the following in sysctl.conf:


net.local.stream.sendspace=82320
net.local.stream.recvspace=82320
net.inet.tcp.local_slowstart_flightsize=10
net.inet.tcp.nolocaltimewait=1

The 10 is arbitrary, but it's also the smallest even multiple that makes the loopback window equal or greater in size than the LAN interface window. There might be some small advantage in doing this if there are programs which may copy the incoming request to some other program via the loopback.

Adding net.inet.tcp.nolocaltimewait frees up resources more quickly for connections on the loopback.

Finally, make the host cache last a bit longer:


net.inet.tcp.hostcache.expire=3900

The reason I do this is that some hosts may connect once an hour automatically. Increasing the time slightly increases the chances that such hosts would be able to take advantage of the hostcache. If you like, you can also increase the size of this hash to allow for more entries. I do this for the TCP TCB hash as well. These have to be changed in /boot/loader.conf as they can't be changed once the kernel is running:


net.inet.tcp.tcbhashsize="4096"
net.inet.tcp.hostcache.hashsize="1024"

So that's it. If these settings are applicable to you, you can just add this to /etc/sysctl.conf:


net.local.stream.sendspace=82320
net.local.stream.recvspace=82320
net.inet.tcp.local_slowstart_flightsize=10
net.inet.tcp.nolocaltimewait=1

net.inet.tcp.delayed_ack=1
net.inet.tcp.delacktime=100

net.inet.tcp.mssdflt=1460
net.inet.tcp.sendspace=78840
net.inet.tcp.recvspace=78840
net.inet.tcp.slowstart_flightsize=54

net.inet.tcp.inflight.enable=1
net.inet.tcp.inflight.min=6144

kern.polling.burst_max=1000
kern.polling.idle_poll=0
kern.polling.each_burst=50

net.inet.tcp.hostcache.expire=3900

And don't forget to edit /etc/rc.conf and add "mtu 8232" for your ifconfig_lo0 line and "polling" for your LAN adaptor.

Labels: software

# posted by Nick : 4:22 PM 6 Comments

Subscribe to Posts [Atom]

Nick Johnson