Deep Dive

Deep Dive

/ by Marek  / , , , , , , ,  + .

Our Peering and Transit Network Upgrade

At FAELIX we have been busy at work planning and preparing a major investment into our network, and in September 2019 we finally saw that project realised. This is the story behind our recent router upgrade.

Timeline

While the changeover from our old to our new infrastructure all happened over the course of several maintenance windows in September, the work really began six months earlier.

March 2019: Inception

Earlier in 2019 we identified a number of things which were holding us back from a technical perspective:

  • we had spent almost a year trying to get a router vendor to fix a security issue and were starting to lose faith in their ability to deliver a fix, and so had begun planning our migration away from MikroTik’s hardware and software as a precaution

  • at the time we still hoped that vulnerability would be fixed, and were that the case we could consider using MikroTik’s CHR as a virtualised router so as to improve BGP convergence time (their CCR series would take over five minutes to perform this task) while keeping the convenience of a commodity router and having the operational knowledge of the platform

  • but there were some shortcomings in MikroTik’s implementation which were a concern for us, such as the lack of IPv6 multihop route recursion (the short-sightedness of their routing protocols to resolve IPv6 routes)

  • and as a signatory to MANRS we wanted to improve our routing security by improving our BGP filtering and perhaps even deploy RPKI validation, and while some of these features were promised in MikroTik’s mythical RouterOS v7, we had no clues of a timeline

There were other factors about scalability, automation, tooling, and flexibility that were also on our pick-list. At this stage our requirements were pointing us towards a complete change of BGP peering/transit router.

April 2019: Hardware

As the deadline for our disclosure of CVE-2018-19299 drew nearer we made our decision. We had narrowed down the options to three or four potential software stacks, all of which were going to run on commodity x86_64 hardware. That meant we could at least order the kit and so we approached several server vendors for quotes. Our requirements were as follows:

  • 4x 10G SFP+ ports (must not be RJ45, cannot be QSFP+ with SFP+ breakout)
  • ideally not Intel NICs because their X710 chipset is buggy
  • would be happy with Mellanox (though I think they only do 2x 10G cards)
  • Chelsio…
  • not 100% sure about Broadcom, though…
  • we can source NICs if required, just need to know how many and how wide the PCI-e ports we have available
  • at least 2x 1G RJ45 ports, can be onboard, can be Intel/BCM/etc
  • low CPU power draw, e.g. quad core Xeon-D would be fine; or maybe 16-core Atoms (if they support enough PCIe lanes?)
  • prefer more GHz over more cores (single-thread performance is important)
  • approx 16GB RAM
  • ideally 1U
  • ideally dual PSU and IPMI
  • 2x 128Gb SSDs is plenty

In the end we chose Sentral Systems who had provided us with the most detailed response, meeting all of the criteria we had set, and theirs was also the most cost-effective. The only part of the specification we really had to compromise upon were the Intel X710 NICs, but we felt that we could mitigate this through the combination of “known-good” kernel/driver/firmware versions for the i40e driver and thorough testing.

The hardware was delivered to us on 23rd April, and we wasted no time in starting our testing. We racked them up in our core Manchester sites, Equinix’s MA1 (Williams House) and MA2 (Reynolds House) so that we could then start evaluating our options going forward.

May 2019: Software

The immediate pressure of CVE-2018-19299 was off: MikroTik had fixed the vulnerability and networks running their software were not under threat of the denial of service attack we had discovered. Now we could take a slightly more relaxed approach to building our future network.

When we had specified the hardware we were motivated by a few existing projects and products. Now that we had it, we needed to decide exactly which route we would go down:

  • continue with MikroTik’s RouterOS, using the CHR as a virtualised router
  • buy and deploy 6WIND’s TurboRouter, which uses DPDK userland networking to achieve up to 20 million packets per second per core of routing performance
  • build our own userland router using DPDK/VPP/FD.io, possibly using Gandi’s packet-journey as a starting point
  • or start with a traditional Linux kernel routing solution, with a view to improving its performance with DPDK or XDP at a later date

On May 15th we requested that one of our upstream providers, Cogent, enable some additional BGP sessions on the /29 linknet we have for transit. Our approach for testing was to announce a more specific /24 from the new infrastructure and test it like a greenfield network. Our network architecture meant that we could spin up virtual machines connected to the existing or future BGP edge, or even both at the same time, and evaluate performance, interoperability, and stability of each of the software options we had.

By the end of May we were fairly convinced which approach we were going to take:

We were going to start with VyOS:

  • an open-source (with commercial support) project forked from Vyatta, which has a long legacy of routing behind it
  • an active community to which we could contribute our energy, rather than building something in a silo
  • a Debian-derived distribution, so building upon our familiarity with that flavour of Linux
  • recently changed to using FRRouting as the BGP/OSPF/etc implementation
  • support for SaltStack, fitting in perfectly with our existing automation tooling

We would use this platform upon which to build systems which could:

  • use salt-minion on the VyOS router
  • pull NIC/address information from Netbox so that our network was built from the documentation (i.e. a single source of truth)
  • rather than use Netbox’s configuration contexts we opted to use file-based pillar data to specify routers’ logical configuration (BGP peerings, etc) as these could then live under our normal distributed version control system
  • our tooling would make a new configuration for the router, programme it, and commit it if successful

We had a view to open-sourcing this project and contributing something back to the community:

  • which showed how to do IRR-based prefix-list generation
  • deployed RPKI in production
  • was improving the routing security of our network, our customers’ networks, and by extension that of our peers and upstreams
  • helped other network operators achieve the same level of confidence that we hoped automation would bring to our network

In the future roadmap of VyOS we could see:

  • XDP acceleration
  • VXLAN improvements, the possibility of BGP EVPN and BGP unnumbered thanks to FRR
  • potentially VRF and MPLS support

Summer 2019: Development and Testing

During the early parts of the summer we span up virtual machines, tested performance, improved our tooling. We knew there was a large amount of work getting ourselves ready for the changeover, and that we couldn’t really be under-prepared.

August 2019: Anticipatory Maintenance

Before beginning our router upgrade, we had a significant piece of work we needed to carry out first. Our core hosting switches, distributed across Reynolds House and Williams House, were running a fairly old firmware by this point. The system image was over two years old, and we had started to see some deficiencies in its behaviour. In particular, our performance testing of VyOS had begun to highlight some oddities in the control plane of our layer-2 hosting network. For example, LAGs were occasionally dropping in and out of the active state, and instead we were seeing traffic on the LAG members’ ports instead:

fs110.m.faelix.net#show mac address-table address AA00.0021.ED06

Aging time is 3600 Sec

Vlan     Mac Address           Type        Port
-------- --------------------- ----------- ---------------------
1321     AA00.0021.ED06        Dynamic     Gi1/0/15


fs110.m.faelix.net#show mac address-table address AA00.0021.ED06

Aging time is 3600 Sec

Vlan     Mac Address           Type        Port
-------- --------------------- ----------- ---------------------
1321     AA00.0021.ED06        Dynamic     Po3

We had also employed a new senior network engineer at the start of August, adding to FAELIX’s combined knowedge and experience. As a team we pored over the switch vendor’s release notes to determine whether a new version would address the issues we were seeing. In the end we were unanimous: a firmware update (and mandatory reboot) was the only sensible way forwards. The frequency of the problems was increasing, though, and we were concerned by the vendor’s notes that this was a resource exhaustion issue. Given the risk was switches rebooting at an unpredictable moment, and potentially behaving erratically or malfunctioning entirely during a high-demand period, we decided to schedule urgent maintenance for 19th-20th August to update the firmware across the switch estate.

This maintenance did not go smoothly, as we had feared might be the case. The switches, already struggling for memory resources, did not all apply the update the first time. Some crashed while uploading the software. In some cases it took multiple attempts to get the switch to write the software to the device’s flash storage. In one case, the switch returned an error four times during the loading of the new image — but had actually succeeded during one of those later attempts as show switch displayed the new version was ready. Rebooting the switches into the new software caused a “split brain” across the estate, causing LAGs split across members to go haywire. We needed to physically disconnect power from one of the devices to rectify this situation.

But in the end, the layer-2 work was completed within the maintenance. The hosting network stabilised, and we could go back to business as usual.

September 2019: Deploying the New BGP Edge

We had taken an architectural decision early on to insert our new BGP edge in front of our older routers, and to keep our hosted services behind MikroTik devices for the time being. This gave us the flexibility of our automated VyOS routers at AS41495’s edge, handling peering and transit. And it meant that intra-customer traffic — between e.g. virtual machines and leased lines — would mostly traverse the MikroTik network behind it.

Our first night’s maintenance went very well, and we celebrated that FAELIX was now validating RPKI around six o’clock in the morning on 25th September.

Our second night’s work was a cable-tidying and -documentation exercise, announced as a maintenance window just in case something untoward were to happen. And, unfortunately, it did. During the follow-on work, a loop was inadvertently created in the management network at Reynolds House. We believe this was caused by a damaged cable which had not been disconnected, and was not creating a link until it was disturbed.

Sep 26 00:00:31 fs110.m.faelix.net-1 TRAPMGR[trapTask]: traputil.c(721) 165059 %% NOTE Link Up: Gi1/0/2
Sep 26 00:00:31 10.13.0.104 interface,info ether07-mgmt link up (speed 1G, full duplex)
Sep 26 00:00:33 lucky.tunnel.cat route,ospf,info OSPFv3 neighbor 46.227.200.1: state change from 2-Way to Init
Sep 26 00:00:33 misty.tunnel.cat route,ospf,info OSPFv3 neighbor 46.227.200.1: state change from Full to Init
Sep 26 00:00:47 misty.tunnel.cat interface,warning ether05-fs112 excessive broadcasts/multicasts, probably a loop
Sep 26 00:06:33 10.13.0.100 route,ospf,error Discarding packet: locally originated
Sep 26 00:14:45 yoyo snmptt[14274]: .1.3.6.1.6.3.1.1.5.3 Normal "Status Events" 149.6.10.132 - A linkDown trap signifies that the SNMP entity, acting in 3 1 2
Sep 26 00:15:22 yoyo snmptt[14274]: .1.3.6.1.6.3.1.1.5.3 Normal "Status Events" 149.6.10.132 - A linkDown trap signifies that the SNMP entity, acting in 4 1 2
Sep 26 00:15:33 10.13.0.104 interface,info ether07-mgmt link up (speed 1G, full duplex)
Sep 26 00:15:34 yoyo snmptt[14274]: .1.3.6.1.6.3.1.1.5.3 Normal "Status Events" 149.6.10.132 - A linkDown trap signifies that the SNMP entity, acting in 5 1 2

The result was a storm to the control plane of our new VyOS routers, which caused numerous issues. Our team brought everything back to normal, continued to monitor the situation for a while, before leaving site at the end of the maintenance window.

This had identified a weakness in our VyOS routers, which we addressed in their configurations shortly after: for example, BGP control plane protection was added to the configuration template generation.

We do regret the issues that happened on the second night — not because we feel we made any serious mistakes, but because they somewhat soured the positive outcome of the previous night’s work. But on the whole, we viewed the deployment as a success that would deliver benefits to our customers for a long time to come:

  • automated and templated deployments reduces the risk of fat-finger errors
  • using version control for configuration
  • prefix-list generation and filtering of RPKI-invalid routes to maximise BGP security

Celebrating Success and Sharing Learning

This journey has been a fairly long one, with some unfortunate bumps along the way, but the destination has been worth the effort (and in some cases, pain). While the goal has been to improve customer services with a very technical deployment, along the way we have learned a lot, made new relationships with suppliers and partners and projects, forged friendships, and inspired others to think about their network architecture

We gave a presentation about this journey at NetMcr, and it was well received. Some network operators have been motivated to look at their own tooling, deploy RPKI validation, or even consider the same stack and approach that we chose. And we hope to open-source our tools soon to help them on their journeys.