I’m a big fan of planning. I live by the clock and the calendar so a good plan before I start the day does wonders for my sense of accomplishment. Whether you plan for your trip to the grocery store or a global network design, the benefits are enormous!
I learned this habit years ago; way back in the days of Exchange 5.0 and me as a wet-behind-the-ears systems administrator. I remember running a project at Cannondale to upgrade from Exchange 5.0 to 5.5. If any of you have been in that position you’ll remember the struggle! We’ve probably discussed it in therapy. I was about an hour before go time and in my youthful ignorance I expected it’d be next-next-finish. When my boss came in and said, “Hey, think we can keep one of these servers up for the sales folks and have everyone else use the new box?” I eagerly agreed. #BiggestITMistakes.
The fallout was immense. Not only did I not check the backup tapes before I started, but the plan necessarily had to change. As a result, half of the company lost email and it was unrecoverable. From the CEO to the Marketing Director to the bicycle engineer. Everyone lost and I knew I would be fired.
That was back in the year 2000 and thankfully all I took was some good natured ribbing. For months I’d get Joe Montgomery (CEO) asking me if I’d “lost anything lately.” I’m now more confident in my technical ability and my standing within an organization. I have no problem telling someone in authority that I’m not interested in deviating from the plan. If there’s a reason, then cancel the change window and we’ll discuss it. I’m not reliving that old experience. Heck, even my wife will slide in the occasional comment with a wistful smile.
These days I plan right down to the CLI changes. Whatever the project is, diagrams will be updated, backup configs are taken, before/after comparisons are gathered, and schedule of events are written down. I’ll even take the time to talk through even simple changes with my teammates so everyone knows what is about to happen. I work in a global organization and I don’t want my buddy in the UK to walk into a disaster unexpectedly. Planning on the fly, during the execution, is the prime ingredient for the calamity soup.
Here’s an example of where this worked well. Last week our “diverse” circuits to the UK were cut due to a car crash in Long Island, NY. Apparently, and this is no joke, both our circuits (Austin to Edinburgh and San Antonio to London) landed on the same telephone pole. We’re still in the middle of our own network transformation and all of these routes between the countries are learned via static routes redistributed to the campus OSPF environment. The company national regions are now dark. KABLEWEY!
To recover connectivity, there were a few items I had to do. Since we’re in the middle of overhauling our routing infrastructure we have static routes galore! At the MPLS ingress/egress to our data centers, we advertise specific routes and then redistribute them from BGP to OSPF. To move data between data centers in the US and UK, static routes are redistributed into OSPF and we’re only able to use one circuit at a time (for various reasons I won’t discuss).
I quickly removed the static routes from the core and verified our MPLS redistribution was working properly. I also removed the route-map from the MPLS routers and just redistributed everything. The administrative distance with BGP and OSPF would make the IGP win for route selection. Approximately 90% of network routes were recovered within 90 minutes. The others were taken care of the next day as the user-barometer came in and screamed. We then waited the weekend for the local utility company on Long Island to repair the pole(s) and then our telco’s to repair their fiber. In the meantime, I began the plan to roll back to our Ethernet circuits…well, that and fighting daily performance fires. In case you didn’t realize, MPLS isn’t the lowest latency transport for WAN connectivity.
There were a couple of items I did to make my plan to revert back to primary circuits more successful. Hopefully this will help any of you who find yourselves in a similar situation.
- Make a backup of your configurations before you need them. During this recovery, I used the last known good configuration for my environment and compared it to what was in production. Remember, in a disaster, things will change because something in the plan is always missed. If you’re not writing down what you’re doing to repair the damage in the short-term, it’ll be extremely difficult to go back to a normal state. Therefore, a clean and recent backup is critical.
- For Cisco shops (and some others), write the config so you can copy/paste it when you execute your plan. If you have the time to script what you’re about to do, any issues that arise during execution will give you more time in your change window to troubleshoot. It does take pressure off and has the added benefit of making you a hero when you close your change 45 minutes early.
- It’s much easier to think about changes when you’re not under the gun to bring up connectivity. Many times junior engineers will smash the keyboard for the quick fix, but it causes future issues. If you take the time to consider the ramifications of your actions, you may be able to bring up connectivity without a future outage to fix what you did.
Just to finish the story from last week’s outage. My execution script was well written because I worked hard at making it correct. I had traffic moved from trans-Atlantic MPLS to 1Gbps trans-Atlantic DCIs in about 20 minutes without dropping a packet. I then took the rest of my change window to verify connectivity to my remote offices, collect that data, and communicate to colleagues exactly what was done. It helped those in the other timezone know what they were getting into.
It’s difficult to remember and this sucks, but for network and infrastructure engineers there is no praise when things go correctly. All of your users expect email and dial tone on a Monday morning. If that doesn’t happen, your the goat. If it does, you’re doing your job no matter what happened behind the scenes. The real win is that you don’t walk into a firestorm and the on-call alerts are quiet.
A good plan with solid execution is one of the best things you can do for your career and sanity.