I recently started investing some time into the ”cool kid stuff’; VMware Enterprise PKS, and Kubernetes. I started with reading some blogs, watching some video’s online and reading official documentation. After a while, I started deploying PKS in my lab and tried to integrate it with NSX-T. The blogs I was using were not talking about dynamic routing at all, only doing static routes. Even the Pivotal documentation set is not mentioning BGP setup in that much detail. Only for the Multi-Tenant PKS deployments. I’m not talking about other dynamic routing protocols because NSX-T only supports BGP.
Most of the times customers who are deploying PKS or considering PKS are running dynamic routing protocols in their environments. Therefore, I’m writing this blog to share how I did my BGP setup within NSX-T and PKS, and then specifically for the route redistribution part.
I’m not going to dig into the BGP neighbor settings and AS numbers and so forth, since that doesn’t really matter. It is all about the route redistribution part which matters.
My PKS subnetting looks like this:
PKS Floating/Load Balancer Pool: 172.20.16.0/24
PKS Nodes: 10.20.0.0./16
PKS Pods: 10.30.0.0/16
As you might know, the PKS subnetting is defined prior to the PKS installation in NSX-T. You have to create 2 IP blocks for the PKS Nodes and the PKS Pods, and an IP pool that is being used for VIP addresses on a NSX-T load balancer.
When PKS is deployed and a Kubernetes cluster is created it will configure tier-1 routers in NSX-T automagically. The subnets being used for the PKS Nodes (aka Master and Worker VMs) are advertised by the tier-1 towards your tier-0 router, which is peering with your physical infrastructure using BGP. The IP addresses used for the Load Balancer are being advertised to the tier-0 router as well. The PKS Pods subnets are being advertised too, but we don’t really need them in the routing table of our physical network devices. Only when you use routed pods you need to advertise them into the routing table of the physical network devices.
Usually, you will provide a number of CIDRs towards the network team of the customer so they can update their ingress route-maps accordingly. The problem with this is that the PKS system is carving out /24 subnets for each Kubernetes cluster out of the IP Blocks for the PKS Nodes and single IP addresses for the Load Balancer VIP. And those will be announced as /24 and /32. So your routing table will look like this:
B>* 10.20.0.0/24 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 10.30.0.0/24 [20/0] via 192.168.77.1, eth1, 00:00:00
B>* 10.30.1.0/24 [20/0] via 192.168.77.1, eth1, 00:00:00
B>* 10.30.2.0/24 [20/0] via 192.168.77.1, eth1, 00:00:00
B>* 10.30.3.0/24 [20/0] via 192.168.77.1, eth1, 00:00:00
B>* 10.30.4.0/24 [20/0] via 192.168.77.1, eth1, 00:00:00
B>* 10.30.5.0/24 [20/0] via 192.168.77.1, eth1, 00:00:00
B>* 172.20.16.10/32 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 172.20.16.11/32 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 172.20.16.12/32 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 172.20.16.13/32 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 172.20.16.14/32 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 172.20.16.15/32 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 172.20.16.16/32 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 172.20.16.17/32 [20/0] via 192.168.77.1, eth1, 00:00:42
B>* 172.20.16.18/32 [20/0] via 192.168.77.1, eth1, 00:00:42
If the networking team updated their route-maps with the provided subnets that you have configured upfront in NSX-T as IP blocks those will not be accepted in the routing table. And you also don’t want to define /24 and /32 addresses in the route maps either. Because this is just too much work and you don’t want to have that many entries in your routing table either.
So how to solve this?
Navigate to the IP prefix lists in the routing configuration of your tier-0 router. Create a new IP prefix list called ‘Allowed Prefixes’ and add the CIDRs for the PKS Floating/Load Balancer Pool and the PKS Nodes to the prefix list. Make sure you set both ge and le to 24 for the PKS nodes subnet, since they are advertised as /24 by the tier-0 router. And ge and le need to be set to 32 for the Load Balancer pool since the VIPs are being advertised as single /32 IP addresses. Set the action to permit for all networks in this prefix list. As you can see the PKS Pods subnet is not added to this prefix list since this is not required as I am not running routed pods.
The next step is to create the ‘Denied Prefixes’ this is fairly easy since we only have to define everything that is left; 0.0.0.0/0. Set the action to deny, obviously…
The next step is to create a route map and add the prefixes in the right sequence. Navigate to the route maps in the routing configuration of your tier-0 router. Make sure that Allowed Prefixes list is processed first and the action is set to permit. The Denied Prefixes list must be set to deny.
Before we are implementing the route-map we first need to configure route aggregation on our tier-0 router. This need to be done in order to summarize the subnets that being advertised as /24 and /32 into the /16 and /24 of our initial defined CIDRs. Navigate to the BGP configuration of your tier-0 router and click on edit. Add the Load Balancer and PKS Nodes subnet to the route aggregation list and save. As of now, the subnets are being summarized while redistributing to other BGP neighbors.
The last step is to activate the created route-map. Navigate to the Route Redistribution section in the routing configuration of your tier-0 router. And select the route-map we created earlier. Make also a note of the sources that I used for my route redistribution.
After saving the config the routing table of the BGP neighbor that is peering with your tier-0 router only should have two entries in the routing table, as you can see in the screenshot below. The ‘192.168.77.1’ in this case is my tier-0 router.
sources:
vmware.com
pivotal.io
versions:
Enterprise PKS 1.5
NSX-T 2.4.2
If you have any questions or remarks feel free to reach out!