Exploring Telekom Malaysia’s RPKI deployment

By on 16 Jan 2024

Category: Tech matters

Tags: , , ,

Blog home

Adapted from Omar Elsharawy's original at Unsplash.

Today, the acronyms RPKI and ROA are nothing new to the network engineer. But back when I first learned about them during APNIC 46 in 2018 (where I was a Fellow), those two acronyms were rarely mentioned in Malaysia.

As time passed, I became more aware of news and APNIC Blog posts about route hijacks, even from the top-tier providers.

Figure 1 — Route Object Validation (ROV) in Malaysia over time.
Figure 1 — Route Object Validation (ROV) in Malaysia over time. Source.

This rang alarm bells for me. Malaysia needed it. As doctors always say ‘prevention is better than cure’.

Figure 2 — The components of RPKI.
Figure 2 — The components of RPKI.

A little bit about Telekom Malaysia

Founded in 1984, Telekom Malaysia (TM) started as the national telecommunications company for fixed line, radio, and television broadcasting services. It is now the economy’s largest provider of broadband services, data, fixed line, pay television, and network services. TM ventured into the LTE space with the launch of TMgo, its 4G offering. TM’s 850 MHz service was rebranded as ‘Unifi Mobile’ in January 2018.

TM now has more than 28,000 employees and a market capitalization of more than RM 25B.

How we began our journey

The objective was clear. We needed to install RPKI validators into our BGP systems and ensure our ROAs were updated. Knowing what to configure and how to do it wasn’t enough to modify the router configuration. We didn’t want to be in deep water without a lifebelt.

Engagement with the higher-up levels in the organization had been planned and organized. Decision makers aren’t necessarily network engineers; some are mid-tier managers. Thus, the approach was to show them a proper flow in a comprehensible network diagram. There is a reason why the idiom ‘a picture paints a thousand words‘ exists. Though I am no Picasso, I did my best to depict what I wanted to deliver. This received the attention and consideration it needed to proceed.

Figure 3 — How the ‘protection mechanism’ to help drop invalid routes works.
Figure 3 — How the ‘protection mechanism’ to help drop invalid routes works.

Calling the shots

That was the easy part — this is where the heavy lifting begins. First, I identified team members from the development team, network operation team, and some representatives from the lab team. All team members would need the same view from the beginning. It’s important not to forget all the vendors targeted to enable the RPKI validation later. We have multiple vendors running the provider edge router (PE) nodes and with those we needed to know what challenges we might face.

Now let’s ‘break a leg’! (Why ‘leg’? It’s because we need our hands and fingers to configure things and troubleshoot later!).

In the lab, all node versions were updated since we noticed certain vendors were not able to activate the validation with their older versions. Which validators to use is a bit tricky since there are multiple options with multiple installation methods. I decided it would be wise to learn from the experiences of others.

Our main reference was the APNIC Technical team who were able to advise which validators others are using and how they approached validation. The keyword here is ‘ask’. There are several experienced engineers out there who are open to giving help and advice. Just ask!

Figure 4 — Deployment timeline.
Figure 4 — Deployment timeline.

Testing — Lab setup and simulation

So, I have the planned sketch ready up until dropping the invalids. At this stage, I wished I could borrow Thanos’ Infinity Gauntlet, just for a few minutes. But, one can only dream, and luckily, I am surrounded by a team of lifesavers.

I wrote down the Acceptance Test Procedure (ATP) for what needed to be tested and validated. My goal was to gather all information on how the lab model ran and assess the load of the router (CPU and memory utilization) to ensure that the test validation wouldn’t harm the traffic running behind it. Three validators had been set up and there were a few problems, such as:

  • Lost connection to the logical switch since we were using Virtual Machines (VMs).
  • Somebody mistakenly downgraded the PE OS causing the config to be stale since that version doesn’t support RPKI.

There were also other minor problems. Despite all the challenges, I still sleep soundly at night and have time to watch movies. I’d rather break things in the lab than bump into it later in production.  

Table 2 shows some of the test cases validated covering multiple vendors and the results of the testing.

 Vendor AVendor BVendor CVendor D
1. Dual peer validatorOKOKOKOK
2. BGP route statusOKOKOKOK
3. Drop InvalidOKOKOKOK
4. Add comm for Unknown routeOKOKOKOK
5. Modify local pref for Unknown routeOKOKOKOK
6. WhitelistOKNANAOK
7. Validator 1 downOKOKOKOK
8. Validator 2 down while 1 still downOKOKOKOK
9. Validator up at the same timeOKOKOKOK
10. Route status when both validators failOKOKOKOK
Table 2 — Some of the test cases being validated and the status of each vendor’s ability.
Figure 5 — Sample of the CPU and memory utilization which we monitored.
Figure 5 — Sample of the CPU and memory utilization that we monitored.

As Optimus Prime once said, ‘roll out!’

With all the lab results looking good, it was time to bring it into production but rushing to install everything is never a good option. I wanted to first deploy the pilot, which is five nodes for each of the vendors and monitor their performance. A proper Method of Procedure (MoP) document was written to ensure the team stayed on track with which node to install first and how many nodes per night.

Controlling the number of nodes in deployment did come at a cost. During validation, we bumped into two issues. First, when we saw a new configuration introduced, it triggered the other running routes in the router to refresh. Internet traffic was okay but for tunnelled traffic running over Generic Routing Encapsulation (GRE) or IPsec tunnels, it had the potential to cause the tunnel to flap and trigger unnecessary alarms. For this situation, we could skip the command line options in question and still meet the objective, thus minimizing the required configuration change.

Secondly, we found that from one vendor, their PE would trigger a route refresh message to the route reflector each time it received a new state and updated the ROA database from the validators. Those route refresh messages cause the route reflector to resend the full Internet prefix set to the node and cause unnecessary CPU consumption at the route reflector. Upon confirmation of the incident, I decided to remove the validation config from those PEs and ask the vendor for a fix. We got those patches later to solve the issue.

Validation-stateRFC 8097
origin-validation-state-invalid0x4300:0.0.0.0:2
origin-validation-state-unknown0x4300:0.0.0.0:1
origin-validation-state-valid0x4300:0.0.0.0:0
Table 3 — Community tagged sample for the ROA validation.

Working with multiple vendors’ equipment means you will have multiple ways of executing and configuring the syntax, not to mention the verification command. We stuck to the goal of determining whether the router drops the ‘Invalid’ when necessary. Where we found a lack of support for some functions, as long as it wasn’t a major issue or a blocker, we let it be. Table 4 shows the different default values that we saw, which we standardized wherever possible. We recorded the numbers in the MoP document for reference.

Validator timerVendor AVendor BVendor CVendor DAll nodes
refresh-time (s)300 (5m)300 (5m)1,800 (30m)300600 (10m)
hold-time (s)600 (10m)600 (10m)1,800×3 (90m) Fix6001,200 (20m)
record-lifetime (s)3,600 (60m)3,600 (60m)3,600 (60m)= hold-time3,600 (60m)
preference (s)1..200 > bestNANA1..10 < best 
white-list invalidYESYESNANA 
Table 4 — Timer and preference used, standardized across vendors.

Challenges

So, we’d finished preparing our node for validation, all policies were in place, all sessions to our multiple validators were up and running…

Let’s drop the invalids!

Wait, are you sure? What happens if the invalids are unintentionally dropped, and the origin was initially triggered by an honest mistake from the IP holder while updating their ROA?

We needed to look at this from another perspective. We collected all the invalids received from all our upstream and peering relationships. We segregated all those IPs and identified how much traffic was flowing from them. From there, we filtered all the traffic that exceeded 100 Mbps, totalling about 12 IP addresses. From there, we explored why they were being flagged as invalid.

There were eight prefixes where their ROA covers up to a /22 but their advertisement was actually a /24. We contacted them and asked them to fix their ROA before TM dropped the invalids and dropped the route that was obviously invalid since it originated by a different ASN.

Figure 6 — Sample of the output where we validate the traffic consumption of the Invalid IPs.
Figure 6 — Sample of the output where we validate the traffic consumption of the Invalid IPs.

Let’s spread the news

As per the plan, we were then validating the final MoP for dropping invalids and were seeking to share the news. We were meant to have a slot during APNIC 56 in Kyoto, but I was not able to make it (oh gosh, that made me remember Mount Fuji. The last time I saw it was while watching Pacific Rim Uprising. Even the alien wanted to go there! Ok, let’s get back on track).

We began by updating our email signature for the team that communicates with our upstream and peering partners. For example:

“TM AS4788 had recently installed RPKI Validators and will drop “Invalid” routes by November 2023. Please update your ROA accordingly.”

An email notification template was also prepared and broadcast. We had also prepared the same template and plan for placing those remarks inside RADB, PeeringDB, and our whois contact.

Figure 7 — Example from AS4788 RADb page.
Figure 7 — Example from AS4788 RADb page.
Figure 8 — Example from AS4788 PeeringDB page.
Figure 8 — Example from AS4788 PeeringDB page.
Figure 9 — Example from AS4788’s whois page.
Figure 9 — Example from AS4788’s whois page.

This blog post is another way we are sharing this news.

Good stories finish early

Although we had some delay in deploying RPKI, we got there in the end.

Was it difficult to deploy? No, despite my team and I never having done this before. We did have hiccups, but it didn’t stop us. We worked our way through with the support of others — APNIC, our vendors, and our upstream partner. As already mentioned, If you need help, just ask.

If I could turn back the clock, would I do it differently? No, this was the right path for us. For others planning to deploy, go for it.

To finish, I would like to quote Helen Keller: “Alone we can do so little; together we can do so much”.

Check out the RPKI @ APNIC portal for more information on RPKI, useful deployment case studies, how-to posts, and links to hands-on APNIC Academy lessons and labs.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

Top