When we started developing the RIPE Atlas platform in 2010, security was (and still is) an important design principle, specifically the need to protect probe hosts.
In this article, we’re going into details about the underlying architecture and the way we’re ensuring secure communication between the various components.
RIPE Atlas architecture
In the image below you can see a high-level overview of the RIPE Atlas infrastructure demonstrating how components are connected to each other:
When probes connect to the network, they use the pre-installed trust anchor material (predefined keys and addresses) to our so-called ‘registration servers’. After analyzing the probe’s geolocation, the current load on the various controllers and other parameters, they decide which ‘controller’ to direct the probes to. The registration servers provide the key of the probe to the controller and the key of the controller to the probe. Then the probe disconnects from the reg.server and connects to its assigned controller. This connection is maintained for as long as possible.
The controllers, just like many other components, keep contact with other components via a Message Queue cluster in an asynchronous fashion. This message queuing provides a large degree of flexibility to manage our distributed infrastructure; components can be added or removed without the need to synchronize the whole infrastructure. Also, each component (machine) can disconnect from the infrastructure temporarily: messages will be buffered at various levels until the connection is restored.
RIPE Atlas probe generations
We have gone through three versions of RIPE Atlas probes and a fourth one is under development. RIPE Atlas anchors are part of the overall infrastructure.
RIPE Atlas v1 and v2 probe
The initial RIPE Atlas probes were custom made and based on the Lantronix XPortPro embedded device. The only difference between them is that v2 has twice the memory v1 has: 16MB vs 8MB (yes, megabytes!).
These devices had pros (very low power usage) and cons (long reboot times and high production costs). There are still around 2,000 v1 and v2 functioning probes out there. They have certainly lived well beyond their expected lifetime.
RIPE Atlas v3 probe
After using the v1 and v2 probes for about three years, we started looking for a new suitable device and found the TP-Link MR3020 that we equipped with a small USB stick. It is a lot cheaper and faster than the v1 and v2 probes, since it’s originally an off-the-shelf travel router produced in large quantities.
Unfortunately, we realised recently that the USB disk actually caused more issues than we had anticipated, which ended in a temporary decrease of connected probes. Fortunately, this has mostly been resolved by new firmware. Nevertheless, we are now looking into brand new hardware that won’t have these issues.
RIPE Atlas v4 probe
We are currently evaluating the NanoPi NEO Plus2. This looks promising, but we are still evaluating it and investigating the logistics, including its stability, power management, heat dissipation, procurement, casing, and software lifecycle.
Communication between probes and the infrastructure
As mentioned above, the probes connect to controllers where they try to keep a connection active for as long as possible. This connection is used to report their measurement results back to the servers so they can be made available to others, but also to send commands — measurements to execute — to the probe. They also receive firmware updates from the RIPE Atlas infrastructure using these channels.
All communication happens using SSH with port forwardings in both directions. In addition to individual SSH keys, session keys and allocated ports are used. It’s interesting to note that we’re using SSH over port 443 because we anticipated that this has a higher chance of success than using the ‘standard’ port 22 assigned to SSH.
The figure below shows how the communication channel is set up in our scenario.
We knew in advance that OpenSSH had good control over local port forwarding. However, restrictions on remote port forwarding are not implemented out-of-the-box for some reason, so we needed to add this.
A lot of work has been done in the field of bi-directional communication technology since we started RIPE Atlas. If we were to start to build the RIPE Atlas architecture today we would seriously consider using web sockets over HTTPS. However, at this point, it does not seem to provide enough benefits over our existing solution to make the change.
Each probe uses an individual SSH key in order to authenticate itself and we can disable each probe separately if necessary, for example, if there is evidence of tampering. Probes only perform active measurements as described above and cannot listen to traffic. Probes do not provide any local services: there is no web server and no local configuration is possible, which limits the attack surface against probes significantly.
The local USB stick on the v3 devices is encrypted with individual probe keys that prevents local firmware attacks. When new firmware is available, probes update in a ‘lazy fashion’: they learn about the new firmware and upgrade the next time they reconnect to the infrastructure, though we also have means to force them to upgrade faster if needed.
Each firmware upgrade is cryptographically verified: the firmwares are signed offline by our development team, and just before upgrading, the probes verify this signature using pre-installed public keys. Versions 1-3 probes can also upgrade their operating system this way.
The RIPE NCC operations team manages the operating system and the probe firmware package of the RIPE Atlas anchors.
Responsible disclosure procedure
Despite all the precautions and the conscious security decision made by design, we all know that bad things can happen. It is important that these incidents get reported back to the RIPE NCC before they can be exploited by others. We provide a responsible disclosure procedure and we have received some such reports since our launch.
We also commission security audits of the entire RIPE Atlas infrastructure on a regular basis, looking at different aspects of our system every time.
Seven years on
It has been fun figuring out how to build such a system over the last seven years.
Although there are other measurement networks out there besides RIPE Atlas, the unique features for each of these mean that there’s no single best practice on how to build one — which made the building of our system a very exciting experience.
I’d like to take the opportunity and give credit to the whole RIPE Atlas development team for recognizing and solving the difficult issues along the way!
Original post appeared on RIPE Labs. Contributors: Mirjam Kühne
Robert Kisteleki is the Manager of the Research and Development department at RIPE NCC.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.