On Repeatable Internet Measurement: Part One

I spent quite a lot of time in 2014 thinking about the following problem: if I hand you a paper that claims something about the Internet, based on data I cannot show you because I am bound by a nondisclosure agreement due to corporate confidentiality or user privacy issues, generated by code which is ostensibly available under an open-source license but which is neither intended to run outside my environment, nor tested to ensure it will produce correct results in all cases, nor maintained to ensure it is compatible with newer versions of the compiler, interpreter, or libraries it requires, what reason have I given you to believe what I say?

This question may be somewhat exaggerated; not every measurement study suffers from each of these problems. However, data availability and tool quality are two of the most important challenges in turning the art of Internet measurement into a science, and make verification of results difficult enough that it is attempted less often than it should be. The incentive to take published results at face value that the current situation contributes to may be damaging to our understanding of the Internet.

Repeatability in general is a current topic in science in general. Mininet, a VM-based environment for network emulation, was built in part to make “runnable papers” about networking possible, and has been applied to allowing students to build simple reproductions of well-known results in networking. In the larger context, a growing realization that lack of reproduction contributes to erosion of the trustworthiness of research findings has led to a broader movement to increase repeatability in research in general.

What follows was originally written as a research proposal for a project which, for various reasons, will not happen. Today’s post is more or less the problem statement. The next in the series will present the solution we proposed and why it wouldn’t work. And the one that follows that (if I make it that far) will consist on musings about what I think this means for the future of the science of Internet measurement.

Repeatability and Privacy

In 2004, Vern Paxson published Strategies for Sound Internet Measurement, a collection of proposed solutions the myriad and sundry problems he and his collaborators had run into over a career of trying to measure the Internet. Almost a decade later, most of this advice still rings true: measurement experiments and tools require calibration, studies should be designed for repeatability, and given that the devil is always in the details, metadata is at least as important as data. While it is instructive and disappointing how little has changed in the intervening decade, there have been attempts in the measurement space to improve the situation.

The most important barrier to repeatable measurement using passively-observed traffic data is the availability of data to multiple researchers, made more difficult by the confidentiality of network traffic data. This is not an accidential or arbitrary constraint: unrestricted analysis of end-user traffic poses a grave threat to end-user privacy, a fact which has recently become better appreciated in civil society. Such analysis therefore carries additional legal and regulatory requirements, and often occur only under restrictive agreements between data providers and researchers.

For much of the past decade, the anonymization of user-identifiable information was seen as the best way to protect user privacy, though at a cost to the utility of the anonymized data for analysis (see Burkhart et al The Risk-Utility Tradeoff for IP Address Truncation). A summary of these techniques as applicable to passively measured flow data is given in RFC 6235. Anonymization can be thought of in terms of two utility functions: the utility of the data to the analysis at hand, and the utility of the data to the attacker attempting to break anonymization and account traffic to specific addresses and users. Burkhart et al (with a slightly different “et al”, including myself) showed in The Role of Network Trace Anonymization Under Attack that for traffic collection on the public Internet, it is always easier for the attacker to increase her utility than the researcher, making “anonymize and publish” an unacceptable model for making data freely available. If a data set is to have utility to any but the narrowest of analysis tasks, technical means of data protection must be supplemented with social, regulatory, or legal means.

So, if data cannot be mobile, analysis must be. The simplest method for analysis mobility today is analyst mobility: researchers visit other institutions which have access to different data, and work on that data there. While this arrangement does allow network measurement researchers to see the world and collect valuable frequent-flyer miles, it does not scale particularly well. Nor does it necessarily allow the publication of studies spanning multiple data collections, subject to the terms of the agreement(s) under which the traffic data is collected and made available for research.

For example, the Trol project was designed to create a “privacy-safe” language for network data analysis, allowing analysis to proceed on unprotected data, with guarantees about the privacy impact of the intermediate results. This is but one example of a whole class of restricted domain-specific languages, each of which suffers from the fundamental risk-utility tradeoff that plagues anonymization: either a restricted language is too restricted to do interesting work in, or not restricted enough to automatically protect the intermediate or final results from deanonymization attacks.

A new approach to this problem appears to be necessary.

Repeatability and Code Quality

Assuming a solution to the privacy problem, there is still little incentive for researchers to think about the meta-problems of research, and to do the engineering necessary to support responsible data curation and access to make Internet measurement studies repeatable and comparable. First, the primary incentive for researchers is publication and citation, and while the additional work required to make research repeatable and maintainable often does lead to better results, it tends to have diminishing returns in publication terms. Second, in an environment where analysis mobility means analyst mobility, the people who wrote the code are always available to fix problems that may arise during analysis, and the fact that the devil is in the details virtually guarantees that problems will arise.

Indeed, many such problems result from the difference between the analyst’s iniital assumptions about the network under measurement and the actual conditions of the network or the effects of measurement errors. However, this need for flexibility during the initial development of an analysis should not be mistaken for a sign that this workflow is the only way to perform traffic analysis research.

If we accept, as in the previous section, that fully automated approaches to data protection in network data analysis will not work, and that analyst-mobility approaches do not scale, then we can solve both problems at the same time by developing a manually-assisted approach which is designed to encourage code quality for repeatability alongside privacy; one such approach will be the subject of the next post Update: post after next. (The next post is a quick interlude to talk about active measurement, specifically, and what we’re doing to address the problems raised here in a active measurement study recently accepted to PAM).