It was only a few weeks back, in July of this year, when I remarked that an Akamai report of an outage was unusual for this industry. It was unusual because it informatively detailed their understanding of the root cause of the problem. It also described the response performed to rectify the immediate problem, the measures being undertaken to prevent a recurrence of this issue, and the longer-term measures to improve the monitoring and alerting processes used within their platform.
At the time I noted that it would be a positive step forward for this industry if Akamai’s outage report was not unusual, in any way. It would be good if all service providers spent the time and effort post rectification of an operational problem to produce such outage reports as a matter of standard operating procedure.
It’s not about apportioning blame or admitting liability. It’s all about positioning these services as the essential foundation of our digital environment and stressing the benefit of adopting a common culture of open disclosure and constant improvement as a way of improving the robustness of all these services. It’s about appreciating that these days these services are very much within the sphere of public safety and their operation should be managed in the same way. We should all be in a position to improve the robustness of these services by appreciating how vulnerabilities can lead to cascading failures.
On 4 October, Facebook managed to achieve one of the more impactful outages in the history of the Internet, assuming that the metric of ‘impact’ is how many users are annoyed with a single outage. In Facebook’s case, the six-hour outage affected some three billion users, if we can believe Facebook’s marketing hype.
So, what did we learn about this outage? What was the root cause? What were the short-term mitigations they put in place? Why did it take more than six hours to restore service (for a configuration change, that presumably had a black-out plan, that’s an impressively long time!)? What are they doing now to ensure this situation won’t recur? What can we, as an industry, learn from this outage to avoid a recurrence of such a widespread outage in other important and popular service platforms?
These are all good questions, but if we’re looking for answers then Facebook’s outage report is not the place to find them. It’s short enough for me to reproduce in its entirety here:
Yes, they are ‘sorry’. They could hardly say anything else, could they?
Yes, they did this to themselves. Again, nothing unusual here, in that configuration changes are the most common cause of service faults. That’s why most communications service providers impose a configuration freeze over the important periods, such as Black Friday in the US or the New Year holiday period. That’s why such freeze periods are typically the most stable of the entire year! But in Facebook’s case, whatever preinstallation tests they performed, if indeed they did any at all, failed to identify a risk in the change process. I guess the engineering team was still applying Mark Zuckerberg’s operational mantra of moving fast and breaking things but doing so with a little too much zeal.
And they “…are working to understand more about what happened today so we can continue to make our infrastructure more resilient.” No details.
I must admit this report is a state-of-the-art example of a vacuous statement that takes four paragraphs to be comprehensively uninformative.
The there was this from an NBC News report:
It seems sad that this NBC report was far more informative than the corporate blather that Facebook posted as their statement from engineering.
What really happened?
For this I had to turn to Cloudflare!
They published an informative post using only a view from the ‘outside’. Cloudflare explained that Facebook managed to withdraw BGP routes to the authoritative name servers for the facebook.com domain. In the DNS, this would normally not be a problem if the interruption to the authoritative servers is relatively short. All DNS information is cached in recursive resolvers, including name server information, and if the DNS cache time to live (TTL) is long (and by ‘long’ I mean a day or longer) then it’s likely that only a small proportion of recursive resolvers would have their cached values expire over a small (seconds only) outage, and any user who used multiple diverse recursive resolvers would not notice the interruption at all. After all, the Facebook domain names are widely used (remember those three billion Facebook users?) so it is probably a widely cached name.
At this point, the second factor of this outage kicks in. Facebook uses short TTLs, so the withdrawal of reachability of their authoritative name servers was relatively immediate. As the local cached entries timed out the authoritative servers were uncontactable, so the name disappeared from the Internet’s recursive resolvers.
But this form of disappearance in the DNS is a form that raises the ire of the DNS gods. In this situation, where the name servers all go offline, then the result of a query is not an NXDOMAIN response code (‘I’m sorry but that name does not exist in the DNS, go away!’) but a far more indeterminate timeout to a query with no response whatsoever.
A recursive resolver will retry the query using all the name server IP addresses stored in the parent zone (.com in this case), and then return the SERVFAIL response code (which means something like: ‘I couldn’t resolve this name, but maybe it’s me, so you might want to try other resolvers before giving up!’). So, the client’s stub resolver then asks the same question to all the other recursive resolvers that it has been configured with. As the Cloudflare post points out: “So now, because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms.”
Then the third factor kicked in. Once the domain name facebook.com and all the names in this space effectively disappeared from the Internet, their own internal command and control tools appeared to also disappear. And this then impacted the ability for their various data centres to exchange traffic, which further worsened the problem. As Facebook’s note admitted, the outage “… impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.”
Other reports on Twitter were more fanciful, including a report that the Facebook office buildings defaulted to a locked mode, preventing staff from entering the facilities, presumably to work on the outage!
What can we learn from this outage?
There are numerous lessons to learn from this outage, so let’s look at a few:
- Rehearse every config change, and always have a blackout plan. Need I say more?
- TTL values in the DNS. If you want to use short TTLs on your DNS then tread very carefully, because DNS caching will not be there to help you get out of a jam. Overall, DNS caching is what makes the Internet work so efficiently and reducing cache lifetimes pushes you ever closer to the edge of disaster!
- Don’t put all your DNS eggs in one basket. The DNS will cooperate quite happily with diverse redundancy, but you need to set this up to make it work for you. So don’t place all your authoritative DNS name servers in a single Autonomous System (AS) in BGP. That’s just asking for trouble.
- It’s always good to split out your command and control plane from the production service. That way you will always be able to access your service elements even in the event of a service failure. This means separate infrastructure, separate routes, and separate DNS domains. Separate everything. Facebook is big. I’m sure they can afford it!
- Moving fast and breaking things only ends up breaking things. At some point users lose patience, and once the users have deserted you, then you can move as fast as you like and break as much as you want. Without any customers it won’t matter anymore!
- If you must make a design decision between speed and resiliency, then think very carefully about what risk you are willing to take and the downside of failure. In this case, Facebook’s failure has meant some 6% being briefly wiped off the market capitalization of the company (oh yes, the testimony in front of the US Senate Commerce Subcommittee on Consumer Protection didn’t help either). But at some point, there is an engineering tradeoff between the costs of additional resiliency measures and the cost of critical single points of vulnerability failing that can cascade. Now, Facebook may have deliberately chosen a high-risk profile, but is this profile the one that necessarily is shared by its consumers, advertisers, and stockholders?
If you want your customers, your investors, your regulators, and the broader community to have confidence in you and have some assurance that you are doing an effective job then you need to be open and honest about what you are doing and why. The entire structure of public corporate entities was intended to reinforce that assurance by insisting on full and frank disclosure of the corporate’s actions.
I could be surprised if, in the coming days, Facebook released a more comprehensive analysis of the outage including root cause analysis, and the factors that lead to cascading failure. It could explain why efforts to rectify the immediate failure took an incredibly long six hours. It could describe the measures they took to restore their service and their longer-term actions to avoid similar failure scenarios in the future. It could detail the risk profile that guides their engineering design decisions and how this affects service resilience. And more.
Yes, I could be surprised if this were to happen.
But, between you and me, I have no such expectations. And, I suspect, neither do you!
Watch Geoff’s presentation from APNIC 52, Learning from our mistakes:
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.
Facebook have posted a followup – you can read it at https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/
As Santosh points out in his followup: “Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one. After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway.” I suppose I am saying that the “we” who are trying to make our digital environment more resilient is more than just Facebook – it’s all of us. We need to look at other industries, such as the aviation or nuclear power industries, who have gone through a sometimes painful process of understanding that such post-event analysis can allow the broader industry to learn from these incidents. I believe that the Internet is now so pervasive and our reliance so critical that we are talking about topics that can be legitimately seen as matters of public safety and security.