m11y and o11y
Looking back over the arc of my career in pseudoacademia, especially over the last three years of digging into transport stack evolution with the MAMI project, there are a few bits of work I’m especially happy to have been a part of. One of these is the inclusion of the spin bit into the QUIC transport protocol. The spin bit was conceived as the minimum useful explicit signal one could add to a transport protocol to improve measurability, the benefit for the overhead is IMO quite worth it. Though it exposes “just” RTT, latency (together with data rate, which is available simply by counting packets and bytes on the wire in any transport protocol that is not hardened against traffic analysis to the point of uselessness) is the most important metric for understanding transport layer performance diagnosing all matter of transport-relevant network problems, and the spin signal itself can also be observed to infer loss and other issues with network treatment of a packet stream. The definition and deployment of the spin bit will therefore make network protocols more measurable while preserving privacy gains from encryption, and is a clear win for network operations and management.
The spin bit is the intersection between what it is currently possible to standardize and a more comprehensive vision of measurability (or, since it’s m followed by eleven letters followed by a y, “m11y”) in protocol design. In a world where our estimation of the Internet threat model didn’t include nation-state actors bent on pervasive passive surveillance(1), some of the more ambitious ideas in the Principles for Measurability paper might be tenable. But here we are.
More formally, a transport protocol is fully measurable if all of the useful metrics about its operation can be derived from observation of its wire image. Maybe I’m too focused on transport, but since transport layer protocols map to the abstractions application developers use to interact with the network – sockets or connections(2) – these are the abstractions that matter for diagnosing problems with services running over the network.
Perfect measurability is hard to achieve in the Internet because of the tension between designing metric exposure into a protocol’s wire image, and the desire to make the wire image less useful for traffic analysis approaches that attempt to infer higher-layer behavior and semantics – semantics that end-to-end encryption has been applied to protect from passive surveillance activities. RTT being basically useless for pervasive passive surveillance while being useful for network operations is the reason we have a spin bit.
I’m an Internet measurement researcher (at least, I am until next Thursday afternoon), so deriving metrics from traffic is interesting to me as an end in itself. However, its primary utility is in network and application monitoring and diagnostics: determining whether some service is functioning properly and with performance acceptable to its users, and if not, why not. Here, measurability is closely related to observability (or, following convention, “o11y”(3)). My own definition of a fully observable system at the application layer would be one that logs exactly enough information to diagnose any arbitrary future failure, but nothing more. This perfect optimum is of course impossible to achieve, but practically useful observability is a matter of careful engineering and, perhaps more importantly, a change in application engineering culture that de-emphasizes experimental reproduction of faults in favor of keeping enough state around to allow fault tracing and identification after the fact.
The tradeoff with observability is different than that with measurability. Observability generally applies within a single administrative domain: it’s not necessary to treat your logging infrastructure as if it might be controlled by an attacker(4), as one must with observers in the case with measurability. But the tradeoff is equally difficult: one must know enough about how a system is likely to break to log the right things, or the logging overhead has a disproportionate impact on system performance.
These two concepts are points on a continuum, and can be used to reinforce each other. A measurable protocol that remains contained within a given domain could carry decryptable identifiers linking to log entries of an observable system, to allow passive measurement devices to correlate exposed transport metrics with appliction-layer events. A protocol running over the Internet could augment the information in its wire image with diagnostic information sent to a third-party logging provider, that could be made available to a network operator for connection diagnostic purposes after the fact. The ideas here are inchoate (and I’d love pointers to work I haven’t seen yet developing them further), but I look forward to finding some time to work them out a bit more in the future.
(1): That the Internet engineering community has bent over backward to address a seven-year old release of information about a decade-old subset of the capabilities of a couple of state security agencies, while possibly losing focus on other threats to privacy and security in the Internet, is the subject of another rant or three.
(2): One could make an good argument that applications-over-transport is stuck in a pre-Web time, and that applications have moved on, as most of them are built around resources as opposed to connections, using HTTP as a session and presentation layer. From a network standpoint, I think measurability still binds to transport, especially as HTTP is and should be behind the encryption veil due to its semantic content.
(3): I’m claiming “m11y” for myself; “o11y” appears to have been coined by honeycomb.io’s Charity Majors.
(4): You do have to treat your logging infrastructure as if it can fail, but that’s a different problem.