Making the Internet Safe for ECN
I’m off to New York in a couple of weeks to present a paper at PAM (which I mentioned here, though sadly the flashy automated demo I was hoping to build was a bit optimistic). The question: “is it safe to turn on ECN on client machines by default, completing the end to end deployment of a simple fifteen year old protocol to give us a better way to signal network congestion than simply dropping packets on the floor?” The answer is: “define safe.” Our key findings:
- More than half of the Alexa top million web servers (600k accounting for duplicate IPs) will happily negotiate and mark ECT0 if you ask nicely (at least as of September 2014). This mainly reflects people upgrading Linux servers to kernels where tcp_ecn=2 is the default, and strongly validates changing default configurations as a method for increasing ECN deployment.
- 0.42% of these webservers will fail to connect if you try to negotiate ECN, but simple ECN fallback as in RFC 3168 (retransmitted SYN ECE CWR sent as SYN) commutes this to a risk of slightly increased handshake latency.
- A vanishingly small number (15 / ~600k) of these have different ECN connectivity dependency depending on where you connect from, indicating that the box breaking ECN is not directly adjacent to the server. A third of these (6) are GoDaddy parking sites.
- There is more mangling of the ECN IP header bits than connectivity dependency, and successful negotiation does not always mean successful marking. About 2% of IPv4 servers and 15% (!!!) of IPv6 servers signal in other than expected ways, indicating that negotiated ECN might not be useful.
- We appear to have seen two (count ‘em, two!) CE markings in the wild, both from the same webserver (www.grandlyon.com) when probing 600k IP addresses 3 times from 3 different locations (i.e., 2 out of 5.6 million flows). This is neither encouraging nor surprising.
Bottom line, the risk to connectivity of turning ECN on by default in clients is vanishingly low, though not yet in the one in ten million range, when simple fallback as in RFC3168 is implemented. Modern Windows and Mac OS X do this; Linux doesn’t yet, though we have a three line patch (which, anecdotally, I’ve been running without incident on my desktop at the office for the past half year).
Given the signaling anomalies, especially on IPv6, defining simple methods to detect and dynamically ignore anomalous signaling at the endpoints is probably the next area of work to getting ECN deployable.
So now I know what I’m doing with the rest of my copious free time…