Nick Johnson: FreeBSD Network Performance Tuning

Tuesday, April 03, 2007

FreeBSD Network Performance Tuning

I've been tweaking the network stack on my FreeBSD host for many moons now, trying to get everything "just right" for optimal network performance. Many of the defaults are a bit pessimistic, assuming a network that experiences a good deal of packet loss and transmits data over a twisted pair of doorbell wire from a PDP-11 in the damp basement of some godforsaken computer lab to a VAX machine surrounded by nerds in a Physics building 2500 miles away. Sure, that may have been a common scenario back in 1982 or whatever, but these days most networks are much more reliable, delivering far more porn at faster rates than ever before.

My tuning is focused mainly on high-performance web serving on a host that also makes connections via localhost for database access and to front-end Resin OpenSource (a Java Servlet container) with Apache. The host has plenty of RAM and CPU available. These tunings may not be appropriate for all situations, so use your head.

First, enable polling on your interface. While you're at it, compile in zero copy sockets and the http accept filter. In fact, just add this crap to your kernel config if it isn't already there:


options         HZ=1000
options         DEVICE_POLLING
options         ACCEPT_FILTER_HTTP
options         ZERO_COPY_SOCKETS

To make sure your device actually polls, edit /etc/rc.conf and add "polling" at the end of ifconfig_{yourInterface}; eg:


ifconfig_bge0="inet 192.168.1.234 netmask 255.255.255.0 polling"

You probably also will want to tune polling a bit with sysctl:


kern.polling.burst_max=1000
kern.polling.idle_poll=0
kern.polling.each_burst=50

Idle poll tends to keep your CPU busy 100% of the time. For best results, keep kern.polling.each_burst <= the value of net.inet.ip.intr_queue_maxlen, normally 50.

Now sit down and think about what bandwidth and latency you want to plan for. This kinda depends a bit on who typically accesses your host. Are they coming from broadband connections mainly? About how far away are they usually? You can get some assistance with this determination by doing a sysctl net.inet.tcp.hostcache.list. Starting in FreeBSD 5.3, hostcache began keeping track of the usual RTT and Bandwidth available for all of the IP addresses it heard from in the last hour (to a limit of course, which is tuneable... more on that later).

We would be interested in the RTT and BANDWIDTH columns, if the number in the BANDWIDTH column had any bearing on reality whatsoever. Since my hostcache routinely suggests that there's more bandwidth available to a remote host than is actually possible given my machine's uplink, it isn't really reasonable to use this number. You can, however, average the RTT to get a rough idea of the average RTT to the current set of users in your hostcache. You can also get a rough idea of the average TCP congestion window size (CWND). Note that this will be bounded by what you have set for net.inet.tcp.sendspace and net.inet.tcp.recvspace. To make sure you're not the bottleneck, you could try setting these two to an unreasonably high number, like 373760, for an hour to collect the data. You can do a sysctl -w net.inet.tcp.hostcache.purge=1 to clear the old hostcache data if you decide to do this.

Here's a dumb little perl script for calculating your average and median RTT, CWND and Max CWND:


open(IN, "/sbin/sysctl net.inet.tcp.hostcache.list |");

while (<IN>) {
    @columns = split(/\s+/, $_);
    next if ($columns[0] eq '127.0.0.1');
    next if ($columns[0] eq 'IP');

    next if ($columns[9] < 2 || $columns[10] < 2);  # skip if few hits and few updates

    push(@rtts, int($columns[3]));
    push(@cwnds, $columns[6]);

    $rttSum += int($columns[3]);
    $cwndSum += $columns[6];
    $cwndMax = $columns[6] if $columns[6] > $cwndMax;

    $entries++;
}

print "Average RTT = " . int($rttSum / $entries) . "\n";
print "Average CWND = " . int($cwndSum / $entries) . "\n";
print "Max CWND = $cwndMax \n";

@rtts = sort { $a <=> $b } @rtts;
@cwnds = sort { $a <=> $b } @cwnds;

print "Median RTT = " . getMedian(@rtts) . "\n";
print "Median CWND = " . getMedian(@cwnds) . "\n";

sub getMedian {
    my @list = @_;
    if (@list % 2 == 1) {
       return $list[@list / 2];
    }  else {
       return ($list[@list / 2 - 1] + $list [@list / 2]) / 2;
    }
}

It's up to you how to use the information the script provides. For me, the most interesting thing to note is that my median RTT is around 100ms and that my max CWND looks to be 122640, at least for the hosts currently in my host cache.

I want to optimize my site for the best possible experience for high speed broadband users.. My home broadband connection is 8Mbps, but it can burst up to 12Mbps for a short time. If we split the difference, that's 10Mbps. This is probably a bit optimistic for most home broadband users. Also note that there's no point in optimizing for more bandwidth than your host actually HAS. In my case, my uplink is 10Mbps, so there's no point in trying to optimize for a 45Mbps connection.

In all probability I won't be able to actually push 10Mbps because I share that connection with some other folks. So let's be just a little bit pessimistic and optimize for 6Mbps. Many home cable services provide between 4 and 8 Mbps downstream, so 6Mbps is a nice "middle of the road" approximation.

To calculate the bandwidth delay product, we take the speed in kbps and multiply it by the latency in ms. In this case, that is 6144 * 100 or 614400. To get the number of bytes for a congestion window that many bits wide, divide by 8. This gives us 76800, the number of bytes we can expect to send before receiving an acknowledgment for the first packet. That's higher than both the median and average congestion window sizes for the folks currently in my hostcache, and about 2/3 of the max. Remember this number.

The next thing to look at is the net.inet.tcp.mssdflt. This is the maximum segment size used when no better information is available. Normally this is set pessimistically low. These days, most networks are capable of moving packets of 1500 bytes, so let's set this to 1460 (1500 minus 40 bytes for headers). sysctl -w net.inet.tcp.mssdflt=1460. This could make the first few packets fail to transmit should MSS negotiation at the start of a TCP connection not happen for some reason or if a network cannot support a packet of that size. I suspect this is quite rare. And we're trying to optimize for the most common case, not the most pessimistic case.

Now we want to make sure that our congestion window size is an even multiple of the default MSS. In fact it isn't. 76800 / 1460 is 52.6027. We round up to the nearest even number - 54 - and multiply by the MSS to get 78840. (I'm not sure why, but many sites recommend that one use an even multiple of MSS.) I round up rather than down because I'm optimistic that I will not have lost that first packet in transit. Rounding down might mean stopping and waiting for the first acknowledgment rather than continuing with one (or two) more packets while awaiting that first reply.

Now that we have our desired window size, let's set it:


sysctl -w net.inet.tcp.recvspace=78840
sysctl -w net.inet.tcp.sendspace=78840

Since we're being optimistic, let's assume that the very first time we talk to our peer, we can completely fill up the window with data. Recall that we can fit 54 packets into 78840 bytes, so we can do this:


net.inet.tcp.slowstart_flightsize=54

Granted, immediately jamming the pipe with packets might be considered antisocial by cranky network administrators who don't like to see retransmissions in the event of an error, but more often than not, these packets will go through without error. I never minded being antisocial. If it really bothers you, cut this number in half. Note that having RFC3390 enabled (as it is by default) and functioning on a connection means that this value isn't used on new connections.

Next, turn on TCP delayed ACK and double the delayed ACK time. This makes it more likely that the first response packet will be able to have the first ACK piggybacked onto it, without overdoing the delay:


net.inet.tcp.delayed_ack=1
net.inet.tcp.delacktime=100

Now enable TCP inflight. The manual page recommends using an inflight.min of 6144:


net.inet.tcp.inflight.enable=1
net.inet.tcp.inflight.min=6144

Finally some tuning for the loopback. Hosts (like mine) that do a lot of connections to localhost may benefit from these. First I modify the ifconfig entry for lo0 to include "mtu 8232" (programs commonly use 8192-byte buffers for communicating across localhost, add 40 bytes for header). Using a similar strategy to what we did above, I tune the following in sysctl.conf:


net.local.stream.sendspace=82320
net.local.stream.recvspace=82320
net.inet.tcp.local_slowstart_flightsize=10
net.inet.tcp.nolocaltimewait=1

The 10 is arbitrary, but it's also the smallest even multiple that makes the loopback window equal or greater in size than the LAN interface window. There might be some small advantage in doing this if there are programs which may copy the incoming request to some other program via the loopback.

Adding net.inet.tcp.nolocaltimewait frees up resources more quickly for connections on the loopback.

Finally, make the host cache last a bit longer:


net.inet.tcp.hostcache.expire=3900

The reason I do this is that some hosts may connect once an hour automatically. Increasing the time slightly increases the chances that such hosts would be able to take advantage of the hostcache. If you like, you can also increase the size of this hash to allow for more entries. I do this for the TCP TCB hash as well. These have to be changed in /boot/loader.conf as they can't be changed once the kernel is running:


net.inet.tcp.tcbhashsize="4096"
net.inet.tcp.hostcache.hashsize="1024"

So that's it. If these settings are applicable to you, you can just add this to /etc/sysctl.conf:


net.local.stream.sendspace=82320
net.local.stream.recvspace=82320
net.inet.tcp.local_slowstart_flightsize=10
net.inet.tcp.nolocaltimewait=1

net.inet.tcp.delayed_ack=1
net.inet.tcp.delacktime=100

net.inet.tcp.mssdflt=1460
net.inet.tcp.sendspace=78840
net.inet.tcp.recvspace=78840
net.inet.tcp.slowstart_flightsize=54

net.inet.tcp.inflight.enable=1
net.inet.tcp.inflight.min=6144

kern.polling.burst_max=1000
kern.polling.idle_poll=0
kern.polling.each_burst=50

net.inet.tcp.hostcache.expire=3900

And don't forget to edit /etc/rc.conf and add "mtu 8232" for your ifconfig_lo0 line and "polling" for your LAN adaptor.

Labels: software

# posted by Nick : 4:22 PM

Comments:

In your calculating script - condition while( "tag"IN"tag" ) not visible

# posted by

Anonymous : 2:26 AM

Fixed.

# posted by

Nick : 8:41 AM

What would you recommend for a freebsd 7 apache 2.2 web server with only 256RAM ?

# posted by

slacker : 12:23 PM

lighttpd or nginx and more RAM

# posted by

Anonymous : 4:29 AM

for "options ACCEPT_FILTER_HTTP"

the Security Advisory "FreeBSD-SA-02:26.accept"

http://security.freebsd.org/advisories/FreeBSD-SA-02:26.accept.asc

besides VERY OLD suggested not to use any 'accept filter' on kernel.

I know that it (probably) is already corrected but preffer to not use anyway. What is you advice on this?

# posted by

irado : 6:20 AM

My advice is to read the entire security advisory. The problem in question was fixed in 2002.

# posted by

Nick : 8:27 AM

Nick Johnson

Tuesday, April 03, 2007

FreeBSD Network Performance Tuning

About Me

Links

Archives