How to: Employing network validation, part 2

In my previous post, I discussed how the organization I work for recently started using network validation to control our ever-expanding core network.

To give you some numbers, we have nine data centres (DCs) all over the globe with more coming soon. Every DC is different in terms of size — it can span from a couple of racks to tens of racks. Automating how we maintain these helps us push changes to production a lot quicker.

In this post, I want to go into more detail on how we use Suzieq to validate key aspects of the network, as well as Batfish, which we use for evaluating the validation process.

Suzieq

Continuously running poller vs snapshot

One of the first decisions we had to make was whether to run the poller in standalone mode or running mode.

Though it is the correct approach, a continuously running poller has a higher engineering cost no matter the tool; the poller has to be running all the time, as it must be highly available, that is, it must be ready to recover from failures.

Running the poller in a snapshot mode is trivial from a maintainability perspective. It can be run independently in any environment — in a local machine (workstation) or continuous integration/continuous delivery (CI/CD) pipeline without the need to have any running service in mind. In our case, we poll the data once and then run Python tests. We use Jenkins for our CI/CD pipeline.

To ensure we run the same tests across all our DCs, we launch multiple Jenkins slaves. If we had used a continuously running poller, the engineering cost would’ve been higher to set up and maintain.

Below is an example of running sq-poller (running in a loop for each DC or region).

for DC in "${DATACENTERS[@]}"
do
  python generate_hosts_for_suzieq.py --datacenter "$DC"
  ../bin/sq-poller --devices-file "hosts-$DC.yml" \
    --ignore-known-hosts \
    --run-once gather \
    --exclude-services devconfig
  ../bin/sq-poller --input-dir ./sqpoller-output
  python -m pytest -s -v --no-header "test_$DC.py" || exit 5
done

You might be asking whether this combination of commands is necessary?

generate_hosts_for_suzieq.py serves as a wrapper to generate hosts from the Ansible inventory, but with more sugar inside, like skipping specific hosts and setting ansible_host dynamically (because our out-of-band (OOB) network is highly available, it means we have several doors to access it).

The generated file looks similar to the following:

- namespace: xml
  hosts:
    - url: ssh://root@xml-oob.example.org:2232 keyfile=~/.ssh/id_rsa
    - url: ssh://root@xml-oob.example.org:2223 keyfile=~/.ssh/id_rsa

Why do we bundle run-once and sq-poller? There is already an open issue that is going to solve this problem. Eventually, it will require adding a single –snapshot option, and that’s it.

Workflow for validating changes

Every new Pull Request (PR) creates a fresh, clean Python virtual environment (Pyenv) and starts the tests. The same happens when the PR is merged.

The simplified workflow is as follows:

Do the changes.
Commit the changes; create the PR on Github.
Poll and run Pytest tests with Suzieq (/tests/run-tests.sh <region|all>).
We require tests to be green before it’s allowed to merge.
Merge the PR.
Iterate over all our DCs one-by-one; deploy, and run post-deployment Pytests again. For example:

stage('Run pre-flight production tests') {
  when {
    expression {
      env.BRANCH_NAME != 'master' && !(env.DEPLOY_INFO ==~ /skip-suzieq/)
    }
  }
  parallel {
    stage('EU') {
      steps {
        sh './tests/prepare-tests-env.sh && ./tests/run-tests.sh ${EU_DC}'
      }
    }
    stage('Asia') {
      agent {
        label 'deploy-sg'
      }
    }

Handling false positives

Every test has the chance of a false positive, that is, the test reveals a problem that is not real. In our environment, false positives occur mostly due to timeouts, connection errors during the scraping phase (poller), or when bootstrapping a new device. In such cases, we rerun the tests until they are fixed (green in the Jenkins pipeline). But if we have a permanent failure (most likely a real issue), the PR does not get merged, and the changes are not deployed.

For false positives, we use the Git commit tag Deploy-Info: skip-suzieq to tell the Jenkins pipelines to ignore tests from when we see this behaviour.

Adding new tests

We test new or modified tests locally before they land in the Git repository. To add a useful test, it needs to be tested multiple times unless it’s really trivial. For example:

def bgp_sessions_are_up(self):
    # Test if all BGP sessions are UP
    assert (
        get_sqobject("bgp")().get(namespace=self.namespace, state="NotEstd").empty
    )

But if we are talking about something like the following, then this needs to be carefully reviewed.

def uniq_asn_per_fabric(self):
    # Test if we have a unique ASN per fabric
    asns = {}
    for spine in self.spines.keys():
        for asn in (
            get_sqobject("bgp")()
            .get(hostname=[spine], query_str="afi == 'ipv4' and safi == 'unicast'")
            .peerAsn
        ):
            if asn == 65030:
                continue
            if asn not in asns:
                asns[asn] = 1
            else:
                asns[asn] += 1
    assert len(asns) > 0
    for asn in asns:
        assert asns[asn] == len(self.spines.keys())

In this instance we would check if we have a unique Autonomous System (AS) number per DC. Skipping the host 65030 is used by routing on the host instances to announce some anycast services such as DNS and load balancers. This is the snippet of the tests output (summary):

test_phx.py::test_bgp_sessions_are_up PASSED
test_phx.py::test_loopback_ipv4_is_uniq_per_device PASSED
test_phx.py::test_loopback_ipv6_is_uniq_per_device PASSED
test_phx.py::test_uniq_asn_per_fabric PASSED
test_phx.py::test_upstream_ports_are_in_correct_state PASSED
test_phx.py::test_evpn_fabric_links PASSED
test_phx.py::test_default_route_ipv4_from_upstreams PASSED
test_phx.py::test_ipv4_host_routes_received_from_hosts PASSED
test_phx.py::test_ipv6_host_routes_received_from_hosts PASSED
test_phx.py::test_evpn_fabric_bgp_sessions PASSED
test_phx.py::test_vlan100_assigned_interfaces PASSED
test_phx.py::test_evpn_fabric_arp PASSED
test_phx.py::test_no_failed_interface PASSED
test_phx.py::test_no_failed_bgp PASSED
test_phx.py::test_no_active_critical_alerts_firing PASSED
test_imm.py::test_bgp_sessions_are_up PASSED
test_imm.py::test_loopback_ipv4_is_uniq_per_device PASSED
test_imm.py::test_loopback_ipv6_is_uniq_per_device PASSED
test_imm.py::test_uniq_asn_per_fabric FAILED
test_imm.py::test_upstream_ports_are_in_correct_state PASSED
test_imm.py::test_default_route_ipv4_from_upstreams PASSED
test_imm.py::test_ipv4_host_routes_received_from_hosts PASSED
test_imm.py::test_ipv6_host_routes_received_from_hosts PASSED
test_imm.py::test_no_failed_bgp PASSED
test_imm.py::test_no_active_critical_alerts_firing PASSED

Here, we catch that this DC test_imm.py::test_uniq_asn_per_fabric test failed. Since we use auto-derived AS numbers per switch (no static AS numbers in the Ansible inventory), there may be several duplicate AS numbers generated, which is bad.

This may look like the following rule if we wanted to check we don’t have a duplicate IPv6 loopback address per device for the same DC:

def loopback_ipv6_is_uniq_per_device(self):
    # Test if we don't have duplicate IPv6 loopback address
    addresses = get_sqobject("address")().unique(
        namespace=[self.namespace],
        columns=["ip6AddressList"],
        count=True,
        type="loopback",
    )
    addresses = addresses[addresses.ip6AddressList != "::1/128"]
    assert (addresses.numRows == 1).all()

This rule is valid and has been proven at least a couple of times — it mostly happens when we bootstrap a new switch and the Ansible host file is copy/pasted.

New tests are added mainly when a failure occurs, and some actions need to be taken to quickly catch them or mitigate them in advance. For instance, if we switch from L3-only to an EVPN design, we might be surprised when ARP/ND exhaustion hits the wall, or L3 routes drop from several thousand to just a few.

Batfish

We evaluated Batfish twice. The first time was more of an overview or dry-run to see what opportunities it provided us. Our first impression was something like ‘What’s wrong with my configuration?’ because, at the time, Batfish didn’t support some configuration syntax for FFRouting (FRR).

FRR is used by Cumulus Linux and many other massive projects — it’s become the de-facto open-source routing suite.

The second time we evaluated Batfish we got a better perspective of how it allows operators to construct the network model by parsing configuration files. On top of that, you can create snapshots, make changes, and then see how your network behaves. For example, you might like to disable a link or BGP peer and predict the changes before they go live.

We started looking at Batfish as an open-source project, too, to push changes back to the community. This has included logging missing behaviour modelling related to IPv6, which is unfortunately not well covered in the FRR model in Batfish as yet.

This is not the first time we’ve missed IPv6 support, and, we guess, not the last. We’re looking forward to Batfish getting IPv6 support soon.

Some best practice observations on testing

From our experience, we’d advise running segregated tests to avoid throwing spaghetti on the wall. To this end, write easy, understandable tests. If you see that two tests depend on each other, it’s better to split them into separate tests.

Some tests can overlap, and if one fails, then the other fails too. But that’s good because two failed tests can say more than one, even if they test similar functionality.

To confirm that tests are useful, you have to run and use them daily. Otherwise, we don’t see much point in having them.

If you can guess what can happen in the future, covering this in tests is a good idea unless it’s too noisy.

As always, the Pareto Principle is the best answer as to whether it’s worth it and how much worth is covered by tests. If you cover at least 20% of the critical pieces with tests, most likely, your network is in good shape.

It’s not worth automating and testing everything you think about. It’s just additional stress for no reason. You have to think about the maintainability of those tests with your team and then decide.

What makes us happy is that Suzieq is great by default, and there is no need to write very sophisticated tests in Python. Command Line Interface is awesome and trivial even for starters. If you need something exceptional, you are always welcome to write the logic in Python, which is also friendly. Python is also wrapped up with the pandas library so you can manipulate your network data as much as you want.

We are still learning as it’s a never-ending process.

Adapted from the original post, which first appeared on the Hostinger Blog.

Donatas Abraitis is a systems engineer at Hostinger.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.