The Domain Name System (DNS) implements the essential service of translating user-friendly domain names to IP addresses, thereby enabling users to connect to online services easily.
The scale and complexity of the DNS make its management difficult, and consequently, zone file errors that lead to performance or connectivity issues are widespread in practice. Given the DNS’s critical role, any errors in zone files can have highly disruptive effects on related services. For example, Microsoft recently experienced a severe global outage impacting all Azure customers for two hours due to a DNS misconfiguration. Some of the other major DNS related-outages include those at GitHub, LinkedIn, iFastNet and HBO.
DNS engineers today use a mixture of techniques such as monitoring, testing, linting, and manual review to maintain their zone files. However, it is easy for errors to slip through with these approaches; as was the case in the Azure outage where a simple one-line error during server migration resulted in inconsistency among zone file replicas taking down critical Azure services.
With testing, it’s not feasible to test all possible input queries as there are tens of billions of queries, and as a result, some of the critical queries can be missed. Monitoring and testing can only catch bugs after the zone files are deployed into the live system. Manual inspection of zone files may also not be feasible if the zone files use CNAMES, DNAMES, and wildcards as it becomes harder for humans to reason about their subtle interactions.
To help DNS engineers prevent DNS-related outages, we developed GRᴏᴏᴛ, which can validate properties of interest for all possible DNS queries or provide a counterexample. DNS engineers can use GRᴏᴏᴛ before deploying zone files to catch any bugs in them.
Some exciting property checks
Using GRᴏᴏᴛ, DNS engineers can proactively identify many DNS zone file errors, for example:
- Rewrite Loop: “Is there a query under our domain that is rewritten in a loop q1 → q2 → q3 → … → q4?”
- Non-Existent Domain for Service: “Is there some execution sequence that will eventually return an NXDOMAIN answer for a known service (for example, server.campus.edu)?”
- Rewrite Blackholing: “Does a query exist under our domain that is eventually rewritten to some other domain name which does not exist, and DNS returns NXDOMAIN?”
- Outside Name Server: “Is there a query under our domain that is sent for resolution to a name server, not under our domain (like, campus.edu)?”
- Outside Rewrite: “Is there a query under our domain that is eventually rewritten to a query that does not end with our domain (like, campus.edu)?”
- Number of Rewrites: “Is there a query under our domain that is rewritten more than twice (or n times)?”
GRᴏᴏᴛ can also check for many other common bugs like delegation inconsistencies, lame delegations, and missing glue records.
109 new bugs revealed in 10 seconds
When we applied GRᴏᴏᴛ to the zone files from a large university campus network with over one hundred thousand records it revealed 109 new bugs in under 10 seconds.
GRᴏᴏᴛ identified bugs in the zone files ranging from delegation inconsistencies to lame delegations to rewrite loops and others.
When applied to internal zone files consisting of over 3.5 million records from a large infrastructure service provider, GRᴏᴏᴛ revealed around 160k issues of blackholing, which initiated a cleanup of the zone files.
Extending GRᴏᴏᴛ with custom properties
GRᴏᴏᴛ is implemented in C++ and takes as input a directory containing a collection of zone files and an optional file specifying what properties (like, no rewrite blackholing) to check. In the absence of this properties file, GRᴏᴏᴛ checks for a set of bugs that are always considered harmful (for example loops). Users implement new properties in GRᴏᴏᴛ as simple C++ functions, and we provide different APIs to make it easier.
As it is infeasible to check each possible query’s behaviour explicitly, we first note that the number of distinct behaviours is much smaller and is a function of the DNS zone files.
Based on this insight, GRᴏᴏᴛ first performs an analysis of the zone files to partition all possible queries into equivalence classes (ECs), each of which captures a distinct behaviour. This partition’s key property is that two queries in the same EC resolve to the same set of possible answers. GRᴏᴏᴛ uses these equivalence classes to validate if properties hold efficiently. For more details about GRᴏᴏᴛ internals, refer to our paper or watch a presentation we gave at SIGCOMM 2020.
GRᴏᴏᴛ is available on GitHub and can be easily installed using our Docker container. To get started with GRᴏᴏᴛ, refer to the documentation and an example provided in the GitHub repository.
Siva Kesava is a PhD candidate at the Department of Computer Science, UCLA.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.