Redesigning the Data Center – BGP

BGP Community Assignment

Austin :1
Los Angeles :2
Dublin :3
London :4

Configuration Files 

A couple months ago I wrote an article describing a project that integrated EIGRP in an attempt to overcome OSPF shortcomings in a network topology. I never felt comfortable introducing a proprietary protocol but really thought I had no choice based on certain technical and business factors. I’ve had some time to revisit the design; wouldn’t you know it, I was able to accomplish all the project goals using BGP. I have to thank a friend of mine, Brian Gannon, for seeing the network a little differently and making a simple suggestion that changed everything. Brian even put in double-duty and prepared some configs to aide the process! I didn’t settle on BGP because I couldn’t get rid of a routing loop in the core (again EIGRP fixed that problem). Brian recommended the simple step of adding BGP communities and making traffic engineering decisions based on that value. Now we’re off to the races! There’s something great about the IT community! Most are interested in the success of others and are eager to help.

A quick design criteria reminder from the previous article. If we can’t meet these goals, the project success may be in doubt.

  • Improve network convergence time.
  • Improve Mean-Time-To-Repair (MTTR).
  • Limit use of static routes.
  • Additional funding for hardware is not available.
  • Reply traffic will follow the same path as the source request. Asymmetric routing will be avoided.
  • LAN traffic will follow local high-speed links before failing to low-speed MPLS
  • MPLS entry points will prefer local destination addresses (e.g. Austin, TX MPLS router will be primary for Austin subnets, London, UK MPLS router will be primary for London subnets, etc.)
  • Redundant MPLS entry points will be preferred by order of in-country networks followed by closest out of country, and furthest out of country (e.g. Los Angeles is secondary, Dublin is tertiary, and London is last resort for Austin subnets).
Redesigning with BGP Lab Topology

Let’s use the same network topography from the previous article. As a review Austin and Los Angeles are connected via a 10Gbps metro-Ethernet circuit, Dublin and London via 10Gbps metro-Ethernet circuit, Austin to Dublin and Los Angeles to London via 1Gbps point-to-point Ethernet service. the idea is one company has acquired another, and you need to introduce the circuits and routing so that static routes and OSPF (a big fat area 0.0.0.0) are replaced. While I didn’t mention it in the previous article, the configurations focus on a Cisco Nexus/IOS configuration but they can definitely be modified for Arista, Juniper, or Cumulus/whitebox. It’s all syntax…mostly.

Core Routing

The BGP design in this scenario uses three main snippets to engineer the traffic around the horn; BGP communities, AS path prepending, and local preferences. If you’re new to BGP, you absolutely have to know how BGP selects the best route. Bookmark Cisco’s “BGP Best Path Selection Algorithm” and refer to it often until it’s memorized! Once you do that, print out this regex chart for reference, and read INE’s blog on BGP regular expressions.

When you get into BGP traffic engineering, there is no way around route-maps, prefix-lists, and regular expressions. You have to recognize that to control outbound path selection, you have to control inbound routes and vice versa. BGP outbound destinations are basically inversely proportional to the inbound prefixes received.

We’re going to use eBGP with communities to make more granular path selection. We’ll make our communities regional (see the sidebar) and the cores will attach the number to all outbound advertisements. Let’s first look at how the communities are applied.

These are your community definitions and will be called by the route-maps we’ll see in just a moment. One thing that may bite you with this is when you write your route-maps to match a community list, you need to know the variable actually calls a named list, not the value in the BGP update/announcement. For example, if you tell your route-map to match community-list 65100:1 and expect that it will match the AS:nn, you won’t see the desired result. The route-map will be looking for a named list “65100:1” to match against. In the below code, you’re route-map will need to match against 11111:to move traffic like you want.

The route-maps are where you set your desired policy to engineer traffic. Anyone can stand up BGP and exchange routes, but it takes a special breed to make that traffic flow the proper direction! In this snippet I named the route-map COMMUNITY-POL so everyone reading the configuration knows this “policy” is based on the “community” advertisement received or transmitted, depending on the route-map direction. This community policy is created on the ATX-CS1 switch and matches *:2 to prepend a longer AS path.

I’d very much recommend using the same policy names for all your route-maps, community-lists, and prefix-lists for all devices. Use a descriptive name so as you move into the automation and orchestration technologies you’ll already have a standard and support technicians won’t have to decipher route-map application. If you redistribute OSPF to BGP name your route-map and associated prefix-lists “OSPF-TO-BGP”, for example. What you see in the COMMUNITY-POL route-map is the policy applied to the advertisement based on the BGP community. In this case, the community 11111:2 will have it’s ASN prepended with 64512 64512; depending on the BGP prefixes, this may make 11111:1 less desirable for the upstream device. The other matches allow that traffic to be installed in the RIB. You have to think about how your route-maps should be applied to an interface on your switch or router so the upstream device knows the preferred path.

What should be apparent is that the route-map COMMUNITY-POL is applied to the transatlantic interface. Why? If you think about sources in Dublin connecting to Los Angeles, the prepend will force that SYN traffic through London before crossing the transatlantic link. With this applied policy we can keep the SYN initially moving north-south before flowing east-west. It may not matter in your environment, but it was a simple design consideration for this write up.

The southern Los Angeles to London leg changes only slightly. Remember that in BGP to control egress traffic you have to control the ingress advertisement (AS prepending) or do one of two options. You either set a weight (Cisco proprietary) or modify the local-preference value. What’s the difference? The weight attribute is only considered within the Cisco BGP stack and is locally significant. The local-preference value will be considered by all routers in the same AS (iBGP peers) and is not proprietary. Just a comment, but if you’re using BGP don’t lock yourself into a limiting value like weight  unless you absolutely have to.

On the southern leg we’ll leverage the local-preference value to fix the asymmetric routing issue. Remember the requirements that the reply follow the same path as the SYN traffic. Given the as-path prepending coupled with the local-preference setting we can still take shortest paths to the destination addresses based on the current network topology.

Notice that on the Los Angeles southern leg the COMMUNITY-POL is connected outbound to the London switch and the LOCAL-PREF is attached to the inbound side from Austin. It looks for the community value and sets the higher preference to return the traffic. I can’t caution enough, you need to remember the BGP path selection criteria; this particular configuration method may not work for your environment, but it does for mine! The old, “Works for me” adage!

I won’t go too far with the MPLS devices and how they announce the routes. It should be pretty clear that they match the community and then prepend the AS path in the proper order. If remote sites need resources in Austin, they should hit Austin, Los Angeles, Dublin, and London in that order. Setting the AS path on the announcement creates the proper path selection based on rule 4 of the selection criteria. The snippet below shows the community-list and the policy applied to the announcements.

Verify It

After all is said and done when you look at the traffic flow between the “corners” of the network you’ll see that moves exactly like you want and is predictable. The SYN follows the proper AS path and the reply from the south follows the local-preference. If you’ve struggled with fixing your architecture, honestly, it’s pretty much the best feeling to see the flows move as expected!

In the output above you’ll see that Austin to London (LON) selects the best path through Los Angeles based on the AS path length. It’s pretty short. The best path from London is chosen through Los Angeles based on the local-preference value even though the two path lengths are equal. This does mean that for London to reach Austin, traffic will always flow through Los Angeles unless there is a link failure. If this were a production environment you’d likely see high latency on that choice. The easy option would be to apply the local-preference on the best interface.

What about the MPLS network that you have? Again, matching the community for as-path prepending makes it all better. One issue is that there are some BGP selection changes if you’re using AIGP or a few best-path configuration options. I’d strongly recommend reading the links covering MED and AIGP from the BGP selection criteria document and verify against your design goals.

In this topology the MPLS core has a few options to route to the Austin and Los Angeles subnets. The remote end will forward to the MPLS core because it has a single connection. The MPLS device will then forward to the destination based on the shortest AS path for ingress to the data centers. In the view of the remote office, the direct path to Austin (10.0.2.1) is the best followed by Los Angeles (10.0.2.5) and Dublin (10.0.2.13).

Let’s Talk Convergence

We’ve actually hit all the requirements for this project, but I’ve not mentioned convergence time. If you accept the default BGP timers, you could be waiting upwards of 180 seconds for route convergence in a failure especially at the MPLS remote sites that can’t detect a failure outside their own interface. If your network is large with a lot of links in a mesh that time could be much longer. Bidirectional Failure Detection (BFD) will speed the reconvergence time for supported routing protocols, including BGP. When you’re working with BGP in the data center, BFD is critical to your deployment. Within the metro-E or MPLS connections of an enterprise, it’s not “required” per se, but certainly has the potential to make you a hero! Indeed, there’s no reason not to run it as a default in your routing protocol build.

In the BFD configuration above, the update interval is 300ms, the minimum receive time is 250ms, and the missed timer is 3x’s the interval. The setting shows a 900ms downtime before routes are converged. This may or may not be acceptable for your particular deployment, but suffice it to say, most TCP-based applications will recover without issue. Mileage may vary, so do your homework for your environment and adjust the timers appropriately.

Conclusion

Honestly, I’m glad I had another engineer that could say, “Wait, what if…?” The technical community is incredibly important and it doesn’t matter how long you’ve been in this business, or what you think you know. Another set of eyes is a good thing especially when that other set of eyes is interested in project success. All egos checked at the door.

This design has some great benefits. The company isn’t locked into a specific vendor, can add new technologies without wholesale changes in network architecture, and scalability not available in other routing protocols. As a network engineer the exciting part of this design is the stability it will bring and all the other technologies you can now support. Things like micro-segmentation, VXLAN, NSX, and EVPN. You’ve helped the business keep printing money and gained some skills that many enterprise engineers lack.

High-fives!

Brian Gleason is a full-time Lead Network Engineer for an Austin, Tx company and is currently pursuing the Cisco Certified Internetwork Expert, Data Center certification. He also teaches firearms in his spare time after being a husband to his wonderful wife and father to his three awesome kids. Brian was selected as a delegate to Network Field Day 20 held in San Jose, CA.