Network engineers have multiple roles: they can be network operators, designers, architects, peering coordinators, and network tools/automation engineers, just to name a few.
As technology becomes more advanced and complicated, so too do network designs, which is part of the reason why more networks are employing more automation and programmability features. For some, these features are taking over the traditional role of a network engineer, ultimately putting them out of a job; while others, including myself, see it as a natural evolution of the job.
What automation is Cloudflare using?
Cloudflare has a large distributed anycast network. We have more than:
- 150 point of presence (POPs) globally
- 500 transits/peering exchange ports
- 500 network equipment components
- 20,000 eBGP (external BGP) sessions
It is impossible to manage this huge network without automation, which we have been using for several years now, including Salt and NAPALM.
Network Automation with Salt and NAPALM
Salt is an open source, general-purpose automation tool that is used for managing systems and devices. It includes a variety of features out of the box, such as a REST API, real-time jobs, high availability, native encryption, the ability to use external data even at runtime, job scheduling, selective caching, and many others. Beyond these capabilities, Salt is perhaps the most scalable framework — there are well-known deployments in big companies that manage many tens of thousands of devices using Salt.
The vendor-agnostic capabilities of Salt are leveraged through a third-party library called NAPALM (Network Automation and Programmability Abstraction Layer with Multivendor support), a community-maintained network automation platform.
Figure 1 — The roles of Salt and NAPALM (source).
Watch the following video from APRICOT 2017 to learn more about how Cloudflare are using SALT and NAPALM as part of its network automation.
One example of the way we have implemented automation at Cloudflare is to monitor and resolve stability issues of transit providers globally.
The system is designed intelligently to automatically manipulate production traffic. It can drop traffic via affected transit providers, drop all traffic in an affected PoP, and resume production traffic according to the severity threshold and time frame that have been set — all without an engineer’s intervention.
Figure 2 — Automation dashboard showing the percentage of PoP to PoP packet loss for a good day (above) and a bad day (below).
The above figure shows a dashboard illustrating the percentage of PoP to PoP packet loss on a good and bad day. In the case of the bad day, having to manually troubleshoot and identify such a huge number of affected PoPs and transits and then fix them would require a lot of time, something that there is not much of when it comes to keeping a network functioning optimally.
Another example of how we are using automation at Cloudflare is for serving anycast.
Cloudflare advertises the same set of prefixes globally from all PoPs. Because of this, we need to be careful when initiating a global configuration push such as when we deploy public resolver 1.1.1.1 on our edges or when we introduce new anycast prefixes.
We are using Salt templating with jinja2 formatting to do a global push to all edge routers. It supports dry run as well. This means the proposed configuration push changes are checked first whether they are good to go or not before pushing the real changes, reducing the risk of human error.
Get with the times
In this advanced world of network operations, engineers need to have some level of understanding on how to code. If they don’t, they need to start now. It will require some re-skilling, however, much of it will leverage existing knowledge and scripting skills.
The bottom line is automation is already an integral part of network engineering. No matter what the task — big, small, simple, or complex — automation can and should play a role as a way to optimize not only the way we manage our networks but also the way we manage or work time as well.
Watch Jimmy’s presentation on how Cloudflare are using automation at IDNOG 5.
Jimmy Lim is a network engineer, managing Cloudflare’s global distributed network.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.