Network Validation Evolution at Hostinger [Part 2]
In our previous post, we discussed how Hostinger started using network validation before going live. By implementing network validation for our core network, we have maintained complete control over running the network at scale.
Among other things, the post summarizes the use of Suzieq to validate key aspects of the network. This time, we’ll get into more detail on how Hostinger uses Suzieq to perform network validation and cover a more detailed overview of Batfish.
To give you some numbers, we have 9 data centers (DCs) around the globe, with more coming soon. Each DC is different in terms of size – it can span from a couple to tens of racks per data center. Using automation on top of that doesn’t make a considerable difference, despite how quickly changes are pushed to production. For the end customer, using services provided by a company that continuously contributes and performs network validation adds to building the foundation of trust and reliability of Hostinger products.
Suzieq
Continuously Running Poller Vs Snapshot
One of the first decisions we had to make with any tool we used to perform network validation was whether to run the poller in standalone mode or in continuously running mode.
A continuously running poller has a higher engineering cost, no matter the tool, though it is the correct approach. With it, the poller has to be running all the time, and it must be highly available, i.e. the poller must recover from failures.
Running the poller in the snapshot mode is trivial from a maintainability perspective. It can be run independently in any environment – on a local machine (workstation) or in CI/CD without a need to have any running service in mind. In our case, we poll the data once and then run the Python tests. At Hostinger, we have deployments spread across many geographic regions – Asia, Europe, US, and we have multiple DCs in each of these regions. We use Jenkins for our CI/CD pipeline. To ensure we run the same tests across all regions, we launch multiple Jenkins slaves. If we’d used a continuously running poller, the engineering cost would’ve been higher to set up and maintain.
Here’s an example of running sq-poller (running in a loop for each DC or region):
for DC in "${DATACENTERS[@]}" do python generate_hosts_for_suzieq.py --datacenter "$DC" ../bin/sq-poller --devices-file "hosts-$DC.yml" \ --ignore-known-hosts \ --run-once gather \ --exclude-services devconfig ../bin/sq-poller --input-dir ./sqpoller-output python -m pytest -s -v --no-header "test_$DC.py" || exit 5 done
You might be asking whether this combination of commands is necessary.
generate_hosts_for_suzieq.py serves as a wrapper to generate hosts from the Ansible inventory but with more sugar inside, like skipping specific hosts, setting ansible_host dynamically (because our OOB network is highly available, it means we have several doors to access it).
The generated file looks similar to:
- namespace: xml hosts: - url: ssh://root@xml-oob.example.org:2232 keyfile=~/.ssh/id_rsa - url: ssh://root@xml-oob.example.org:2223 keyfile=~/.ssh/id_rsa
Why bundle run-once and sq-poller? There is an already open issue that is going to solve this problem. Eventually, it requires just adding a single –snapshot option, and that’s it.
Workflow for Validating Changes
Every new pull request (PR) creates a fresh, clean Python virtual environment (Pyenv) and starts the tests. The same happens when a PR is merged.
The simplified workflow was:
- Make changes.
- Commit changes, create a PR on GitHub.
- Poll and run PyTest tests with Suzieq (/tests/run-tests.sh <region|all>).
- We require tests to be green before a PR is allowed to merge.
- Merge the PR.
- Iterate it on all our DCs one by one – deploy, and run post-deployment PyTests again.
Something like:
stage('Run pre-flight production tests') { when { expression { env.BRANCH_NAME != 'master' && !(env.DEPLOY_INFO ==~ /skip-suzieq/) } } parallel { stage('EU') { steps { sh './tests/prepare-tests-env.sh && ./tests/run-tests.sh ${EU_DC}' } } stage('Asia') { agent { label 'deploy-sg' } }
Handling False Positives
Every test has a chance of a false positive, i.e. the test reveals a problem that is not real. This can be true if it’s a test for a disease or a test for verifying a change. At Hostinger, we assume that false positives will happen, and that’s normal. So, how do we handle them, and when?
In our environment, false positives occur mostly due to timeouts, connection errors during the scraping phase (poller), or when bootstrapping a new device. In such a case, we re-run the tests until it’s fixed (green in the Jenkins pipeline). But if we have a permanent failure (most likely a real one), tests always remain in a red state. This means the PR does not get merged, and the changes are not deployed.
However, in the case of a false positive, we use a Git commit tag Deploy-Info: skip-suzieq to tell Jenkins pipelines to ignore tests after we see this behavior (as you may have noticed before in the pipeline file).
Adding New Tests
We test new or modified tests locally first before they land in the Git repository. To add a useful test, it needs to be tested multiple times unless it’s really trivial. For example:
def bgp_sessions_are_up(self): # Test if all BGP sessions are UP assert ( get_sqobject("bgp")().get(namespace=self.namespace, state="NotEstd").empty )
But if we are talking about something like
def uniq_asn_per_fabric(self): # Test if we have a unique ASN per fabric asns = {} for spine in self.spines.keys(): for asn in ( get_sqobject("bgp")() .get(hostname=[spine], query_str="afi == 'ipv4' and safi == 'unicast'") .peerAsn ): if asn == 65030: continue if asn not in asns: asns[asn] = 1 else: asns[asn] += 1 assert len(asns) > 0 for asn in asns: assert asns[asn] == len(self.spines.keys())
This needs to be carefully reviewed. Here we check if we have a unique AS number per DC. Skipping 65030 is used for routing on the host instances to announce some anycast services like DNS, load balancers, etc. This is the snippet of tests output (summary):
test_phx.py::test_bgp_sessions_are_up PASSED test_phx.py::test_loopback_ipv4_is_uniq_per_device PASSED test_phx.py::test_loopback_ipv6_is_uniq_per_device PASSED test_phx.py::test_uniq_asn_per_fabric PASSED test_phx.py::test_upstream_ports_are_in_correct_state PASSED test_phx.py::test_evpn_fabric_links PASSED test_phx.py::test_default_route_ipv4_from_upstreams PASSED test_phx.py::test_ipv4_host_routes_received_from_hosts PASSED test_phx.py::test_ipv6_host_routes_received_from_hosts PASSED test_phx.py::test_evpn_fabric_bgp_sessions PASSED test_phx.py::test_vlan100_assigned_interfaces PASSED test_phx.py::test_evpn_fabric_arp PASSED test_phx.py::test_no_failed_interface PASSED test_phx.py::test_no_failed_bgp PASSED test_phx.py::test_no_active_critical_alerts_firing PASSED test_imm.py::test_bgp_sessions_are_up PASSED test_imm.py::test_loopback_ipv4_is_uniq_per_device PASSED test_imm.py::test_loopback_ipv6_is_uniq_per_device PASSED test_imm.py::test_uniq_asn_per_fabric FAILED test_imm.py::test_upstream_ports_are_in_correct_state PASSED test_imm.py::test_default_route_ipv4_from_upstreams PASSED test_imm.py::test_ipv4_host_routes_received_from_hosts PASSED test_imm.py::test_ipv6_host_routes_received_from_hosts PASSED test_imm.py::test_no_failed_bgp PASSED test_imm.py::test_no_active_critical_alerts_firing PASSED
Here, we notice that this DC’s test_imm.py::test_uniq_asn_per_fabric test has failed. Since we use an auto-derived ASN per switch (no static AS numbers in the Ansible inventory), a race could happen that might have a duplicate ASN, which is bad. Or something like:
def loopback_ipv6_is_uniq_per_device(self): # Test if we don't have duplicate IPv6 loopback address addresses = get_sqobject("address")().unique( namespace=[self.namespace], columns=["ip6AddressList"], count=True, type="loopback", ) addresses = addresses[addresses.ip6AddressList != "::1/128"] assert (addresses.numRows == 1).all()
This is done to check whether we have a duplicate IPv6 loopback address per device for the same data center. This rule is valid and was proven at least a couple of times. It mostly occurs when we bootstrap a new switch and the Ansible host file is copy-pasted.
Mainly new tests are added when a failure occurs, and some actions need to be taken to quickly catch them or mitigate them in advance in the future. For instance, if we switch from an L3-only to EVPN design, we might be surprised when ARP/ND exhaustion hits a wall, or L3 routes drop from several thousand to just a few.
Batfish
We have already evaluated Batfish twice. The first was kind of an overview and dry-run to see its opportunities for us. The first impression was something like “What’s wrong with my configuration?” because, at that time, Batfish didn’t support some of the configuration syntaxes for FRR. FRR is used by Cumulus Linux and many other massive projects. It’s becoming the de-facto best open-source routing suite. And that’s why Batfish has FRR as a vendor included as well. It’s just that FRR as a model needs more changes before being used in production (at least in our environment).
Later on, a month or two ago, we began investigating the product again to see what could really be done. From the operational perspective, it’s a really cool product because it allows the operator to construct the network model by parsing configuration files. On top of that, you can create snapshots, make some changes and see how your network behaves. For example, disable a link or a BGP peer and predict the changes before they go live.
We started looking at Batfish as an open-source project too to push changes back to the community. Here are a couple of examples of missing behavior modeling for our cases:
https://github.com/batfish/batfish/pull/7671/commits/4fa895fd675ae60a257f1e6e10d27348ed21d4a0
https://github.com/batfish/batfish/pull/7694/commits/115a81770e8a78471d28a6a0b209eef7bc34df88
https://github.com/batfish/batfish/pull/7670/commits/10ec5a03c15c48fd46890be4da394170fa6eb03a
https://github.com/batfish/batfish/pull/7666/commits/f440c5202dd8f338661e8b6bd9711067ba8652b6
https://github.com/batfish/batfish/pull/7666/commits/974c92535ecb5eedfe8fd57fc4295e59f2d4639d
https://github.com/batfish/batfish/pull/7710/commits/a2c368ae1b0a3477ba5b5e5e8f8ebe88e4bf2342
But a lot more are missing. We are big fans of IPv6, but unfortunately, IPv6 is not (yet?) well-covered in the FRR model in Batfish.
This is not the first time we’ve missed IPv6 support, and, we guess, not the last either. We’re looking forward to and hoping Batfish will get IPv6 support soon.
Some Best Practice Observations on Testing
We would say that segregated tests serve to avoid throwing spaghetti at the wall at first. Write easy, understandable tests. If you see that two tests are depending on each other, it’s better to split them into separate tests.
Some tests can overlap, and if one fails, then the other will too. But that’s good because two failed tests can say more than one, even if they test similar functionality.
To confirm that tests are useful, you have to run and use them daily. Otherwise, there isn’t much point in having them.
If you can guess what may happen in the future, covering the possibility in tests is a good idea unless it’s too noisy.
As always, the Pareto Principle is the best answer to whether it’s worth it and how much worth is covered by tests. If you cover at least 20% of the critical pieces with tests, most likely, your network is in good shape.
It’s absolutely not worth automating and testing all the things you come up with. It’s just additional taxation for no reason. You have to think about the maintainability of those tests with your team and make a decision.
What makes us happy is that Suzieq is great by default, and there is no need to write very sophisticated tests in Python. CLI is really awesome and trivial even for starters. If you need something exceptional, you are always welcome to write the logic in Python which is also friendly. Wrapped with the pandas library you can manipulate your network data as much as you want, it’s very flexible.