I’ve spent a large amount of time over the past few years thinking, proof-of-concept building, and playing with various network monitoring technologies.
This two-part series is a collection of my thoughts and ideas on the current state of monitoring as an industry and why we need to monitor; and best practices for how we can monitor more effectively within a budget.
Why do we need to monitor our networks?
This should be a simple question for most of us to answer but I want to outline some of the more high-level reasons so we can address them later on.
React to failure scenarios
The worst possible scenario for a network operator is to have a critical component fail and not know – an unknown. For us as operators to deliver reliable services, it’s critical we have complete insight into every possible dependent component in our network that delivers our services.
To ensure we scale and build our networks, platforms and applications accordingly, sufficient information is required to predict long and short-term trends.
With the right data and approach to analysis, many emergent behaviours can be derived to give new insight into how our customers are using our applications and networks.
Drive continual change
Network monitoring should play a key role in reacting to unexpected outages and ensuring they don’t reoccur. As business goals and priorities change, monitoring should be an important tool to ensure that applications and networks meet their goals and customer expectations.
The current state of network monitoring
Small and medium ISPs and enterprises commonly use commercial out-of-the-box or entry-level monitoring software packages that promise high visibility of a wide range of applications, devices and operating systems. Some examples of these packages include PRTG, Zabbix, and SolarWinds. This choice is generally due to any limitation of time and resources, and thus a software package that provides auto-discovery, ‘magic’ configuration and lower overheads is the path of least resistance.
Unfortunately, these types of packages tend to suffer from a lack of focus with regards to matching visibility with organizational goals. By trying to target as many customers and software or hardware types, these packages can generate noise and false positives by creating too many automatic graphs and alerts that aren’t focused on ensuring higher-level goals are met.
Meanwhile, Internet giants like Facebook, Google and Netflix are building and open-sourcing their own tools due to a need to collect, analyse and visualize data of unprecedented scale and complexity. A perfect example is the FBTracert utility that is designed to identify network loss across complex ECMP fabrics. Obviously, this isn’t an option for everyone, as these organizations are fortunate enough to be able to leverage the software development skills and resources already available to them in-house.
Other limitations that small to medium ISPs and enterprises face include:
Time and resources
For many small to medium ISPs and enterprises, time and resources (be that money or human) are limited, particularly when it comes to monitoring. Sometimes the minimum is enough. Other times, more effort and more customized software packages can result in a greater reward for the business and its operators.
Network device compute power
Until recently, operators have had limited on-box processing power that has placed limitations on the volume and resolution of metrics available. This has limited the features that vendors could implement (such as Netflow vs sFlow) and limited the rate operators could poll metrics. Only in the last few years have we started to see devices with excess processing power and the functions that allow operators to use it.
Vendor feature support
We have recently seen vendors realise that visibility and insight is a huge area that network devices have been lacking in. Features such as streaming telemetry and detailed flow-based tracking are slowly becoming available but only on expensive flagship products. New software always means bugs and it’s my opinion that certain vendors are preferring to use their customer base as bug reporters and not do as much testing as what’s required. Also, vendors will tend to implement features based on the market majority, of which smaller ISPs are usually not the target.
Previous generation devices
Unfortunately, we all don’t have the budget to rip and replace our networks every two to three years to keep up with the product life cycle of the major vendors; hands up those of you who are using previous (and N-2/3/4) generation hardware? This limits the features we can deploy uniformly across our networks, which, unfortunately, means plenty of SNMP and expect scripts.
In my next post in this series I will discuss best practices for efficient network monitoring, including, which metrics to monitor for, choosing a suitable alert system, and designing for resilience.
This is an adapted version of a post that was published on Tim Raphael’s personal blog.
Tim Raphael is a Peering Engineer at Internet Association of Australia Inc.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.