Thanks to everyone who responded to my earlier post about the need to preserve and curate the history of Internet measurement for future historians.
Based on the interesting conversations so far, I thought I would go a step further and collect some thoughts on how this might actually be made actionable, in the form of something that could evolve into an Internet History Initiative (IHI). You can now:
- Register your interest as a potential supporter or contributor: Google Form
- Visit the IHI status page.
- Follow the initiative on Mastodon.
Our goal is to collectively figure out how to index and curate the history of bulk Internet measurement datasets, preserve them against loss, and interpret their collective legacy for future generations of historians.
Three common responses to my first piece were:
Isn’t this already the job of (existing repository)? Let’s just support them
Yes, the first job for an IHI (and for everyone in the Internet measurement community) should be supporting and celebrating the projects that have preserved this data since the day they collected it.
RouteViews and RIPE, in particular, need a lot of love from their respective communities to keep doing what they do, and the job of an IHI would be to make sure everyone understands the value of the irreplaceable datasets they maintain for all of us to study.
That would be really expensive, who’s going to pay for it?
(‘That’ being the costs of making really large datasets more persistent and usable.)
Ask any librarian — persistence is expensive, and it requires continuous investment.
A core goal of the IHI would be advocacy for measurement:
- Making sure the originating institutions receive sufficient collection management resources to cement and extend their historical legacy.
- Finding enough resources on top of that to improve survivability and availability and integrated access across multiple collections, beyond any single project.
I’ve heard from many of you already, offering to point the way toward resources to support these goals, so thank you. Let me circle back after we’ve refined the ideas a bit in a collective conversation.
Don’t forget to include (this other fantastic, large dataset)!
Ah, here are my people. We haven’t even started yet and there’s more to do! It’s a good reminder that whatever we build together needs to scale to include several measurement products, not just the historical Border Gateway Protocol (BGP) and traceroute datasets. We need to capture the emergence of things like the Resource Public Key Infrastructure (RPKI), which have changed the way the Internet operates. And we need the ‘meat on the bones’ that will help historians interpret what hosts were doing with all those IP addresses, and where those hosts were located throughout history.
Let’s take a quick look at the potential order of operations for getting an IHI off the ground, starting with the challenges I mentioned in my previous post.
Challenge #1: Preservation
Hold onto this one for a moment, and we’ll come back to it. You’ll see why.
Challenge #2: Curation
A reasonably small number of older organizations currently have large amounts of original Internet measurement data on hand. They make it available at old, well-known URLs that tend not to change for many years, so the well-known URLs are the most persistent identifier available. The content is too large to trivially mirror, and since the data is used by a relatively small research community, there often aren’t many (any) available mirror sites.
There’s also very little metadata available, separate from the data artifacts themselves. Generally, the URL encodings capture the most important aspects, which are things like collector locations and the time range (yyyymmdd.hh), through some combination of yyyy/mm/dd paths and filename encodings.
To pick an example at random, here’s a set (.bz2) of BGP updates from the route-views2.oregon-ix.net BGP collector, generated 22 years ago, with 15 minutes of data starting at 23:15 PST on 11 January 2002.
At RouteViews, the URL encodes collection time in the local timezone of the collector (here, Pacific Standard Time, UTC -8, since RouteViews didn’t change to UTC for most of its collectors until 2003). That means that the data encoded in the file covers (approximately) the 15-minute UTC interval starting at 07:15 on 12 January 2002.
Here’s a roughly contemporaneous set of observations (.gz) from the original rrc00 route collector within RIPE RIS:
In this case, if someone wanted to study (for example) an eight-hour window of Internet history from all available sources, starting at midnight UTC on 12 January 2002, they’d have to do some curatorial work upfront, before any data is retrieved:
- Identify the available products (in the archive, BGP RIBs, and update dumps).
- Track them down to their originating institutions (in the archive, RIS, and RouteViews).
- Identify all of the subproducts within each institution (for example, BGP collectors that were live).
- Synthesize sets of institution-specific URLs that correspond to the study window.
- Identify tools to uncompress and unpack the data formats represented by all those files.
This is work that can’t be avoided, but if we’re smart, we can do it once, automate it, and let future researchers skip these steps and get right to work on the data.
Persistent digital identifiers for measurement artifacts
One stepping stone to making this work automated and repeatable (and, over time, making the underlying artifacts more available) would be the introduction of persistent digital identifiers for each of our historical artifacts. There are many competing identifier standards, but for a variety of reasons, I think Archival Resource Keys (ARKs) are a reasonable fit.
“All problems in computer science can be solved by another level of indirection, except for the problem of too many layers of indirection.”
Robert Wheeler, by way of Butler Lampson
The basic idea behind an ARK is to put a layer of indirection between the retrieval resource (the old familiar URL that you’d use to grab the gigabytes) and its ‘permanent identity’ within our global collection of Internet measurement datasets. That opens the door to being able to associate metadata with each accessioned artifact to support efficient search and filtering, and having the ability to point the user toward different homes for the same historical artifact, depending on their location (for example, pointing to the original source, or to a cached replica preserved somewhere else).
There’s nothing particularly magic about an ARK, or indeed about any form of digital identifier. At best, it represents an institutional promise to invest the energy to maintain the persistent connection between an identifier and the underlying artifact and to keep the associated metadata current and correct.
Resolving digital identifiers to underlying digital resources
The magic of ARKs is in the resolver step, and for the IHI we might construct a time-aware resolver library, to make the common case easy — generating and retrieving a stream of artifacts that represent all of the available BGP updates collected during a given time window, across all collector locations, across all contributing institutions.
A good resolver would probably also support suffix-based content negotiation, for example, a researcher might prefer to retrieve a BGPDump text file generated from the original binary MRT format if both were available as alternative resolutions for a single ARK. The single ARK maintains confidence in the connection between the original and alternative formats for ‘the same’ measurement set, which will remain intact and unchanged.
Decentralizing the resolution step
(Challenge #1: Preservation)
So the first step for an IHI would be to formulate the namespace for the standard sorts of archival products that are out there, and then work with each hosting institution to create ARKs and associated metadata for the objects in their collection, while pulling a reference instance of each artifact to create an offsite copy, along with its verifying metadata (data format, MD5 checksum).
Without pulling down the reference instance, archivists might see this as violating a kind of important rule — only create persistent keys for things whose persistence you are responsible for (a guiding principle sometimes stated as ‘curate your own stuff, not other people’s stuff’).
But one of the overriding goals of getting all of these already-public collections into one curatorial namespace will be to let the community do the curatorial work and avoid imposing extra work onto the individual collections. Another will be to set up a path to improving the persistence of the collections through replication.
As already mentioned, for example, we can add content-distribution functionality to the ARK resolution layer based on preference or user location, allowing a single ARK to resolve to any of several URLs, each guaranteed to return an identical copy of the same historical digital artifact, thus opening the door to higher availability.
Finally, we’d probably want to decentralize and replicate the resolver layer itself, to make sure that the death of a single Internet history project wouldn’t somehow spell the end of Internet history. The map of ARKs to artifact URLs should be something that anyone can pull from a public repository and rebuild for themselves. We could publish updates to the accession map over ActivityPub. We could push the resolver map into the Interplanetary File System (IPFS). Perhaps we could federate the artifacts’ storage to improve their multisite persistence. The topic needs some energetic exploration.
Challenge #3: Interpretation
So, returning to the original motivation — say that we want to study an eight-hour period of Internet history beginning at midnight UTC on 12 January 2002. We’d want our system to generate ARKs representing all of the known historical data products that cover that window, of all different flavours. The associated metadata would let us select tools (or spin up servers with the toolchains in place) to let us process those datasets.
If the data in question has been mirrored, there may be many independent-but-identical archival copies of the data we need, to let us spread our demand for that data over all the available copies, or to be able to pick a mirror that’s ‘close’ to the place where we will be processing the data.
Researchers will then run their code over the artifacts to generate their own derived data products. Eventually, if they are useful and stable, those data products in turn will need to find a home. Perhaps the original institutions provide some space for that as well; perhaps these products live on in the researchers’ own repositories (along with their local copies of the data they pulled to do the reduction), all accessioned with their own ARKs, and all made discoverable and replicable by other researchers within the IHI. The metadata of derived copies can include the toolchain information you’d need to replicate the reduction process yourself, as well as the ARKs the reduction was sourced from. Warm fuzzy feelings all around.
In a perfect world, the repositories for the data would also offer compute services very close to the data, to make working with it easier. In some scenarios, it may actually be cheaper for institutions to allow researchers to run their extraction/reduction functions close to the data, than to pay the bandwidth costs to push copies of the data to researchers over the Internet. There’s room for innovation here. But then we’re back to ‘who pays’.
Conclusion: Next steps
At this point, people from various Content Delivery Networks (CDNs) and cloud platforms will be lining up to fill my inbox with helpful outrage. There’s nothing new under the sun when it comes to content distribution strategies, high-availability storage solutions, and cloud computing. Everything on the wish list here could be accomplished by buying into one centralized platform or another, and they’d probably be glad to sponsor the effort.
It’s true that commercial CDN and cloud platform offerings are almost certain to be among the available solutions that we draw on at some point to preserve these datasets. But at the end of the day, I think it would be unwise (and expensive, and not in the Internet spirit) to put any single one of them at the heart of the Internet history challenge from the start (and yes, we keep circling back to who pays for it, don’t we?).
So there you have it. I’d like to invite the community to step up and tell me about your ideas for decentralized / federated alternative approaches to curation and the high availability of large scientific datasets. I’d really like to find a distributed, self-hosted solution that doesn’t make any single person or institution the single point of failure for preserving the critical datasets that make up the Internet’s history.
Jim Cowie is co-founder and Chief Data Scientist at DeepMacro. He has over 25 years of experience as a data storyteller in Internet measurement and recently launched the Internet History Initiative.
Adapted from the original on Jim’s blog.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.