Friday, 3 June 2011

TCP tuning Details

The following are important for TCP performance, and the default values of 1 are fine:

net.ipv4.tcp_window_scaling
net.ipv4.tcp_timestamps
net.ipv4.tcp_sack

Notes:

some people recommend disabling tcp_timestamps. We do not recommend this for high-speed networks. It may help for home users on slow networks, as timestamps add an additional 10 bytes to each packet. But more accurate timestamp make TCP congestion control algorithms work better, and are recommended for fast networks.
some people recommend increasing net.tcp_mem. This is not usually needed. tcp_mem values are measured in memory pages, not bytes. The size of each memory page differs depending on hardware and configuration options in the kernel, but on standard i386 computers, this is 4 kilobyte or 4096 bytes. So the defaults values are fine for most cases.
For more information on TCP variables see: http://www.frozentux.net/ipsysctl-tutorial/ipsysctl-tutorial.html#TCPVARIABLES

Starting in Linux 2.6.7 (and back-ported to 2.4.27), Linux includes alternative congestion control algorithms beside the traditional 'reno' algorithm. These are designed to recover quickly from packet loss on high-speed WANs. Starting with version 2.6.13, Linux supports plugable congestion control algorithms. The congestion control algorithm used is set using the sysctl variable net.ipv4.tcp_congestion_control, which is set to bic/cubic or reno by default, depending on which version of the 2.6 kernel you are using.

To get a list of congestion control algorithms that are available in your kernel (if you are running 2.6.20 or higher), run:

sysctl net.ipv4.tcp_available_congestion_control
The choice of congestion control options is selected when you build the kernel. The following are some of the options are available in the 2.6.23 kernel:

reno: Traditional TCP used by almost all other OSes. (default)
cubic: CUBIC-TCP
bic: BIC-TCP
htcp: Hamilton TCP
vegas: TCP Vegas
westwood: optimized for lossy networks
If cubic and/or htcp are not listed when you do 'sysctl net.ipv4.tcp_available_congestion_control', try the following, as most distributions include them as loadable kernel modules:

/sbin/modprobe tcp_htcp
/sbin/modprobe tcp_cubic
NOTE: There seems to be bugs in both bic and cubic for a number of versions of the 2.6.18 kernel used by Redhat Enterprise Linux 5.3 - 5.5 and its variants (Centos, Scientific Linux, etc.) We recommend using htcp with a 2.6.18.x kernel to be safe.

For long fast paths, we highly recommend using cubic or htcp. Cubic is the default for a number of Linux distributions, but if is not the default on your system, you can do the following:

sysctl -w net.ipv4.tcp_congestion_control=cubic
On systems supporting RPMS, You can also try using the ktune RPM, which sets many of these as well.

More information on tuning parameters and defaults for Linux 2.6 are available in the file ip-sysctl.txt, which is part of the 2.6 source distribution.

Warning on Large MTUs: If you have configured your Linux host to use 9K MTUs, but the connection is using 1500 byte packets, then you actually need 9/1.5 = 6 times more buffer space in order to fill the pipe. In fact some device drivers only allocate memory in power of two sizes, so you may even need 16/1.5 = 11 times more buffer space!

And finally a warning: for very large BDP paths where the TCP window is > 20 MB, you may hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to locate the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK. This appears to have been fixed in version 2.6.26.

Also, I've been told that for some network paths, using the Linux 'tc' (traffic control) system to pace traffic out of the host can help improve total throughput.