
In a beautifully typeset post, Paul Tagliamonte critiqued the meme that almost any failure in a software service can be attributed to the Domain Name System (DNS). He coined a new, unnamed rule:
If you can replace ‘DNS’ with ‘key value store mapping a name to an ip’ and it still makes sense, it was not, in fact, DNS.
Paul Tagliamonte
Meanwhile, Jonathon Belotti interrogated the incident report that Amazon produced following the recent outage in their us-east-1 region, including the role of the DNS in that incident. His post illustrates the power of Paul’s rule by drawing lessons to be learned and thinking critically, even when there is a DNS query involved.
DNS beyond the meme
Rather than stop at the mimetic explanation, Jonathan shared a seasoned take on the different components of a complex system that failed and how this had an accumulating impact.
Both Paul and Jonathan distinguish between events actually ‘inside’ the DNS, and events that expose as a failure in DNS-related information, but are actually not caused by the DNS as such.
Paul gives the example of each:
- Actually, the DNS: A divergence in DNSSEC interpretation leading public DNS services 8.8.8.8 (operated by Google) and 1.1.1.1 (operated by CloudFlare) to return different responses during a DNSSEC roll-back for slack.com, causing a partial Slack outage depending on which public resolver responded to your DNS queries.
- Not actually the DNS: An outage that affected DNS data-flow beyond a time-to-live timeout in your cached labels, leading your automation to remove DNS data (the
us-east-1outage).
Jonathan’s post probes more deeply into the outage following the same line of reasoning. Software has many dependencies, and these can be design flaws in how you construct an online system, or they can be sets of smaller problems that line up in the classic ‘Swiss cheese holes’ model, which James Reason advanced in the early 1990s.
When dogfooding collides with name-address mapping
Jonathon’s post also raised the question of ‘dogfooding’ — deliberately designing your systems to depend on internally managed and maintained components that you also offer to the wider community. The term comes from the saying ‘eat your own dog food’, presumably because if you can’t stomach it, your dog can’t either. Dogfooding is something all significant service delivery organizations come to, sooner or later.
The problem in Jonathan’s example reminded me of a challenge faced by an older order of service delivery network — the electricity supply network. Electricity networks must be able to manage the so-called ‘black start condition’, when a large portion of the electricity network goes offline after being separated from the rest of the network.
There has to be some device, such as a small petrol generator, which can be used to make enough independent voltage to power up the control systems to excite the electromagnets of a bigger generator. The bigger generator can then begin to make enough electric power to re-energize the network, return control systems to a fully functional state, and reconnect to the rest of the network at a synchronized frequency.
You need this in electrical networks, and you need this in software-as-a-service networks in the form of manual intervention from engineers and the ability to roll back to a known-good state.
Then, when you’re running your root cause analysis, you treat the implication of DNS as a signal that the interdependencies of your complex systems are inadequately understood. Pay special attention to how components like name-to-address lookup interact with the rest of the system.
In effect, it’s not always the DNS, but it is often exposed through the DNS.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.