Demystifying OSPF on a vPC

A common issue among network engineers managing the Cisco Nexus platform is how to form routing adjacencies on hosts over a Virtual Port Channel (vPC) peer link. Back in the day when engineers wanted to do a multi-chassis port-channel to increase redundancy and resiliency you had to use a Nexus 5500 or 7000 with 2200 Fabric Extenders hanging off of them. The basic idea was use the 7000/5500’s at the data center distribution and the 2200’s at the top-of-rack, then have redundant connections from the FEXs to each core. The problem with this architecture is the support for route update traffic across a vPC peer link.  VPCs are a layer-2 virtualization protocol and Cisco hasn’t been able to support routing traffic along these links. That is until recently.

New code makes this feature easy, but if you haven’t, or can’t, upgrade past the 7-oh train you’ll need to use this information in your data center to achieve your desired goal. There are some fundamentals that need to be cleared up before jumping into the configuration. Virtual Port Channels have two components in the physical topology—other than the pair of Nexus switches—that are required to create a vPC. First, the vPC peer link. Second, the vPC keepalive link. Without these two connections configured on the Nexus core, vPCs do not happen. The vPC peer link is the most important of the two within the vPC configuration. It “fools” the switch pair and makes them think there is a single control plane between them. This link acts as the transport for Bridge Protocol Data Units (BPDUs), Link Aggregation Control Protocol (LACP) packets, MAC address synchronization between aggregation groups, and IGMP synchronization when snooping.

If the Nexus uses HSRP, for example, on it’s SVIs the peer link carries the HSRP frames between the switches. Each VLAN ID must be configured on both switches and be allowed across the peer link, assuming VLAN pruning is configured on the dot1Q trunks. For example, this sample output shows the vPC peer link configuration on two Nexus 9372s. The example topology has a single Cisco ISR4321 router connected to LAB-CS1. While this is simple, it gives you the idea of the problem we’re solving.

The peer-keepalive is a routed link, or path, used to deconflict a dual-active issue. While a vPC switch peer is actively forwarding traffic, only one switch can be the master “ring” to rule the system. If both switches become active, the peer-keepalive is the link used to resolve the problem. Think of it as the Vanilla Ice of the vPC circuits. When configuring this link it’s best to use a special vPC-management VRF (Virtual Routing and Forwarding) across the dedicated management interface of the Nexus switch. Additionally, this link does not need to be directly connected between the two switches, but, if you attach it to a network switch, be certain to place it on its own VLAN to isolate the management traffic.

This takes us to the solution to our vPC peer link routing problem. What happens when you have an upstream device needing to exchange route information? What if this device needs to form an adjacency with all routing hosts on the broadcast domain? This is where you need to be careful with your architecture.  

The Scenario

vPC/OSPF Lab ToplogyIn this example topology you have a Cisco ISR 4321 running OSPF upstream of the spine Nexus 9372 switches. The spine is using an SVI (VLAN192, 192.168.1.0/29) as the egress zone to the ISR4321. If you just configure the basics of the VPC, SVI, and OSPF you would see that the OSPF adjacency mechanisms are forwarded along the VPC peer link and there will be a failure of the routing protocol on LAB-CS2. If the active path is on LAB-CS1 you’ll see OSPF attempting to form the adjacency on LAB-CS1 and LAB-CS2, but it’s getting stuck on LAB-CS2. One side will nail up, but the other will flap or remain in EXSTART. Convergence time will be slower as the OSPF process needs to finish the neighbor adjacency process during the firewall device failure, exchange routes, and forward traffic. The ISR4321 connection to LAB-CS1 is also known as an “orphan port” because it isn’t multi-homed between chassis. If you are having problems forming route adjacencies in this topology you will see something like this.

Cisco did not support this topology prior to NX-OS 7.0(3)I5 so there are limited work-arounds to deal with this scenario. If your organization is anything like a majority of enterprises, you’ll find yourself in this kind of configuration. Hopefully you can fix it before a) it’s in production, or b) in a network down state. So what do you do when you need the interface redundancy/resiliency but you also need layer-3 reachability?

Option One – Prune Your VLANs

If your upstream devices are peering to an SVI on the switch, your best option would be to

  • Create another ISL trunk between the core switches
  • Prune the L3 Egress SVI from the vPC peer link
  • Allow the VLAN across the new non-vPC ISL trunk

This should make some sense if you remember that vPC is a Layer-2 virtualization technology. By pruning the  Layer-3 transit VLAN from the vPC peer link and moving it to another trunk would allow the upstream device a path to form the routing protocol adjacencies. These sample configurations show the two switches with the vPC peer link. VLAN192 is the egress VLAN that uplinks to the HA pair of Internet facing firewalls. To address this architectural limitation, add another link to carry the layer-2 traffic between the switch spines, prune the appropriate VLAN from the peer link, and allow it on the new ISL. That VLAN will then be carried across the new port-channel and OSPF will nail up properly.

After you create the new ISL and prune the VLAN on the spine switches, there is one more step to get OSPF to nail up. There is a slight limitation in the Nexus 9000 platform that will cause some strange results. If you just prune the VLAN from the peer link and add to the new ISL, you may still see one of the upstream adjacencies stuck in EXSTART. If you don’t, it will only be a matter of time before you start chasing the outage-ghosts as routes fail for some mysterious reason. The reason for this is that the Nexus 9000’s share MAC addresses of the SVIs as the vPC peer link MAC address. This issue is well beyond the scope of this article, but if you want to read about it, you can see the details in Cisco’s documentation, NX-OS Interfaces Configuration Guide and Supported Topologies as related to the Nexus 9000 platform. To resolve the MAC address issue you’ll need to add a static MAC address on the SVIs you pruned from the peer link. This basically tells the Nexus OS not to punt the local-to-switch ARP requests to the CPU so that it answers for the other chassis. Strange, heh? To have the other switch answer the routing protocol traffic simply add a static MAC to the SVI. In this scenario I use the first 16 bits as all zero’s, the second 16 bits as the  VLAN ID, and the last to identify which switch the MAC is assigned to. This makes the interface function easily identifiable when troubleshooting in the future.

At this point the Layer-3 transit VLAN is removed from the peer link; it’s added it to a new dedicated non-vPC transit ISL; and a statically assigned MAC address is added to the SVI. Once that happens, you should see OSPF move into full adjacency almost immediately. It’s like Dijkstra is saying, “Where have you been all my life?! Want to go out for a drink or something?”

If you need to do this in a production environment be prepared for some service interruption. Changing the MAC address of the SVI will cause the switch to relearn the MAC table for that VLAN and when you prune a VLAN from one trunk and add it to another, there will be a service outage as you type the commands and STP recalculates the layer-2 path. Running R-PVST (Rapid Per-VLAN Spanning-Tree) limits the interruption time and you should only drop a couple of frames. A lot of moving parts are happening which is why I suggested this work be completed prior to production! I had to make this change on a couple of production switches a couple weeks ago. With the likelihood of only a couple lost frames, I believed this would be fine. Unfortunately, I had an HSRP issue with my upstream device even though I dropped one packet and OSPF still formed up in the topology. It took about 15 minutes to address which was certainly within the timeline of polling alerts! Crow doesn’t taste good, even with a little salt.

Option 2 – Layer-3 ISL Switchlink

But what if you have a WAN connection on one switch, and a Metro connection on the other? What if both use /30’s as the transit subnet? How do these attached devices exchange routes? The way this looks in the configuration is pretty straight forward. As a simple side-bar, I like all of these options when I stand up new Nexus boxes; a vPC peer link, a non-vPC layer-2 “transit” link, and a layer-3 “transit” link. When I create a new SVI that requires a routing protocol, in my script I just add the SVI to the proper ISL trunk and set a static MAC address to the new SVI.

If I do a routed link, I don’t have to worry about any of the other hoops to engineer the traffic flows. I carve out a /30 between the route devices on each switch, then let the layer-3 “transit” ISL help exchange and forward routes. If you have different point-to-point links carrying the same subnets known by all the switches between locations, it’s possible traffic entering switch 1 needing to go to switch 2, will go through a different office and return to the destination switch; an extended trombone network, if you will, based on OSPF calculations for shortest path. Think of like this: from a remote office, traffic enters LAB-CS1 destined for another remote office location connected to LAB-CS2. Without the Layer-3 ISL trunk between LAB-CS1 and LAB-CS2 this traffic flow has to go through LAB-CS3 and LAB-CS4 to get there. Adding a layer-3 ISL addresses the problem because the next hop is only one switch away.

Once this part of the configuration is complete, OSPF will have a full topology within your enterprise network.

Option 3, The Easy Solution

After all of this, some of you may be saying to yourselves, “There must be an easier way.” In fact, there is. Passing routing protocols across the vPC has been needed for quite some time. Arista Networks’ version of the Virtual Port Channel, known as MLAG (Multi-chassis Link Aggregation), had this functionality almost on day one. My guess is you wouldn’t be reading this article if you had their equipment. If you want the easy solution, you need to install NX-OS version 7.0(3)I5(2) or better. This version of code includes a command issued within the vPC domain enabling routing across the peer link. No new cabling. No new port channels. No new pruning. In fact, this becomes less dependent on the physical connectivity of your hardware. Simply issue the command  layer3 peer-router inside the vPC domain on both of the vPC pairs, and you’re off to the races! Let’s start with the faulty configuration:

After adding the command to the VPC domain, just like the above options, OSPF nails up across the switches.

 

Conclusion

Whenever you’re building something new it’s extremely important to read through the related vendor documentation. They have done extensive testing, normally, and there really isn’t a reason to reinvent the wheel. Especially if you don’t know what the limitations of a wheel with no hub or spokes may be. Unfortunately most of us have grown up in the instructions-are-for-suckers mindset and that really needs to change. Quickly. Networks and applications are too important and complex to integrate a half-cocked theory based on a Sybase CCNA book. Those books give you the broad understanding. You have to look at the vendor documents to get the specifics of the platform limitations you’re supporting. Hope this helped the community of networkers!

Brian Gleason is a full-time Lead Network Engineer for an Austin, Tx company and is currently pursuing the Cisco Certified Internetwork Expert, Data Center certification. He also teaches firearms in his spare time after being a husband to his wonderful wife and father to his three awesome kids. Brian was also selected as a delegate to Network Field Day 20 held in San Jose, CA.

Leave a Reply