24/7 uptime and the 60-second SLA

Photo by Jason Rosewell on Unsplash.

What service-level agreement (SLA) do you offer your clients? Your end-users? 99%? 99.9%? 99.99%? Do you imagine that any Internet user has ever thought: “Oh, the service is down, but my SLA hasn’t been breached yet, so that’s okay?” I didn’t think so.

I work as the Network Operations Manager for HEAnet, Ireland’s National Research & Education Network (NREN), and when I joined the company in 2001 we could, if needed, take network links down for over an hour during the working day. We could run potentially invasive research projects on the live network. We absolutely cannot do that anymore.

It’s a cliché to say that the Internet has changed, but these days most users just expect it to work, always and forever. This means that service providers, be they commercial or NRENs, work to an effective 60-second SLA. This is, roughly, the time between the user going offline and their deciding that the fault lies with the service provider.

We network operators also need to remember that the ‘always on’ expectation is 24/7. Even if users are asleep in one timezone, they may well be up, awake, and desperate to use a service or consume content somewhere else, still relying on part of your service.

So, the big question becomes, how do we cope with a 60-second SLA? The answer here is easy: we don’t. We can’t. Systems fail, mistakes are made, and users want, in general, to pay as little as possible for the best possible service.

But what we can do is fake it. We can improve our chances of avoiding outage in all sorts of ways and with great communication and processes, we can stretch time when the inevitable does happen.

Many of us are already implementing a lot of what I’ll be discussing, and that’s great! However, there are always things operators miss.

Communication, resilience, and automation

The core point in all of this is honest communication. This includes internal communication between your technical and non-technical teams, and external communication between you and your clients/users. If you take one thing away from this blog post, let it be that communication is the single most important thing any of us can do and it’s almost always possible to communicate more.

Resilience is the next most important of all of this. If you have two links, that hugely reduces the chance of a failure. But do you also have two provisioning systems, two monitoring systems and so forth. Are your communication tools resilient? Have you tested your resilience? Recently? Have changes been made since the last time you tested it all?

Now, resilience isn’t cheap. It’s extremely rare that any home user, or even most small or medium corporate clients, will pay for a second connection, and you can’t give it away for free. But there should be plenty of risk analysis mixed in with the purely financial analysis.

Automation is also important to consider. Humans make a lot of mistakes and I’ve never known a computer to typo a line of configuration. In HEAnet, we’ve been using automatic configuration systems for over a decade and there’s huge confidence that comes from knowing the software has done the same thing right a thousand times already. I’m certainly more confident in the system than I am in any human ability to get anything so consistently correct.

Automation also enables automatic system restarts and all sorts of other fun things that can fix a problem before anyone even knows there is one.

Version control and configuration databases are closely associated with automation. Remember, the network should not be the authoritative source of configuration/information. Keep that somewhere else and upload/apply as necessary. This doesn’t just mean a copy in RANCID. It means every line of config should be written somewhere else (git or a similar system is my current favourite) and applied, with changes made to the repository, not straight on the router.

Watch a video of Brian’s presentation on this topic at RIPE 74 here.

Good, clear processes are vital

Humans are still necessary, of course, and those humans should follow processes. Too many engineers seem to have allergic reactions to change management and change control, but most of them eventually embrace these things.

Good processes create a framework in which good engineers thrive. Configuration and code is reviewed and improved. Changes are planned, rather than randomly rolled out. Good processes also create a safety net, vital for the maintenance of a blame-free environment, which is, in turn, vital for the improvement of those processes.

Good and clear incident management is also vital here. When (not if) something happens, the relevant engineers and communications staff need to know how to find out what’s wrong, what to do and whom to tell. Engineers deciding that something is an ‘easy five-minute fix’ is one of the surest ways of making sure communication doesn’t happen and users get frustrated.

Make sure every incident process emphasizes alerting others to the issue. This gives others the opportunity to help and to communicate with those affected. No engineer should ever be the only one who knows about an issue. No engineer (or manager) should ever be afraid to ask for help.

This all comes back to honest communication again. Raising an issue and calling things out is vital. Telling your users “we know there is a problem, we’re working on it” can often buy you time.

Setting, and meeting, expectations on updates is similarly vital. If you’ve ever wondered why users are still complaining, despite all the hard work you’re doing, it’s because that work is not being properly communicated. If someone doesn’t get an update they will automatically assume you aren’t doing anything. This is hugely unfair, but it’s also utterly human. So even if there’s nothing substantial to report, it’s still important to set and meet expectations.

Users will also be more lenient in their reactions if they feel that real efforts are being made and that the service, overall, is improving. The best way to lose clients is to have the same issue happen over and over again.

Users are never going to be tolerant of service outages — that’s a fact of life. However, with communication, resilience, communication, automation, communication, good processes, and communication, 60 seconds can be stretched out for quite some time!

Brian Nisbet is Network Operations Manager at HEAnet, the Irish National Research & Education Network. He is also Co-Chair of the RIPE Anti-Abuse Working Group and an active member of a number of operator communities.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Communication, resilience, and automation

Good, clear processes are vital

Leave a Reply Cancel reply