wiki'd

by JoKeru

EADDRINUSE (Address already in use)

The problem: while running a proxy service (squid), customers start getting this error message:

Socket Failure The system returned: (98) Address already in use Squid is unable to create a TCP socket, presumably due to excessive load. Please retry your request.

By checking the logs and running a strace on the process, we get this:
[cc lang='bash']
\$ tail /var/log/squid/cache.log
2014/05/22 20:12:40| commBind: Cannot bind socket FD 482 to 10.20.30.40:0: (98) Address already in use

\$ strace -p `cat /var/run/squid.pid` -e bind
bind(482, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.20.30.40")}, 16) = -1 EADDRINUSE (Address already in use)
[/cc]

Google-ing the error, everybody agrees that "you have run out of free ports, all available ports are occupied by TIME_WAIT sockets"

How can this be possible when you have:
- tcp_fin_timeout = 5
- default local port range (ephemeral ports)
- 50 IPs configured on the outbound interface (tcp_outgoing_address squid directive)

The range of ephemeral ports a client program can use (unless otherwise specified by the program) on modern Linux OS distributions by default is from 32768 till 61000 (for systems with more than 128 MB RAM) and from 1024 till 4999 (or even less) for systems with less than 128MB of RAM. This range is defined in the kernel parameter /proc/sys/net/ipv4/ip_local_port_range and it affects both TCP as well as UDP client connections.

Doing a simple math, there should be 50 IPs x (61000 - 32768) Ports = 1.411.600 Sockets available, but the system had only \~30k active !

The answer lies in the following sentence: Linux shares the assigned list of ephemeral ports across all local IPs for unconnected sockets.
So no matter how many IPs you have on the server, you'll only be able to use 28.232 sockets.

The fix:
[cc lang='bash']
# net.ipv4.tcp_tw_reuse didn't help
\$ echo 'net.ipv4.tcp_tw_recycle = 1' >> /etc/sysctl.conf
\$ sysctl -p
[/cc]

The result: the traffic more than doubled while no other metric (cpu usage, load, response time) was degraded
[caption id="attachment_1128" align="aligncenter" width="626"]tcp_tw_recycle 'tcp_tw_recycle = 1' effect on a busy server[/caption]

UPDATE: when you fix something, be careful not to break something else :)

Comments