Don’t Trust, and Definitely Verify

This weekend I decided to update some firmware on some Dell layer 3 switches (PowerConnect 6248).  I only did it at one site to test before I updated the firmware on our production switches.  Before I updated those switches, I updated the firmware on a spare switch to be prepared in case one of the switches didn’t upgrade properly (it’s happened a couple times before).  Everything was great.  I came in on Saturday afternoon and updated the stack and everything seemed to go great (tip: make sure to turn off HA on all your ESXi servers before you do a network update).  After the update I tested internal email, external email, internal networking, internet, etc., etc., and so forth.

The one thing I didn’t test was the direct connection to our production site.  Low and behold I got a call at 11pm and found out that our support team was unable to connect to the production site.  The production site was up, everyone on the outside could get to it (I made sure to test the bread and butter of the operation…no SLA issues), but made the mistake of not testing the direct connection.  The configs were exactly the same, I spent two hours on the phone with Dell and they said everything looked good.  They had no idea what was causing the issue and insisted it was a routing problem.

At 1am the support person I was speaking with let me know that the night shift was basically a break/fix team and they wouldn’t be able to help me any further.  He would be sure to put it in the queue for the day shift who is more versed in how firmware updates may cause issues.  The probelm needed to be solved immediately, though.  At 1:30am I met my boss at work (a good half hour drive for both of us).  Immediately we saw the port that is used for our direct connection was down.  For some reason it was the only port that was down.  We did some troubleshooting and realized the config for that particular port was wrong.  Okay, great…we can just change the config.  I changed it to what it should be, applied the changes, and all of a sudden the whole stack just reloaded and set the port back to the incorrect settings.  Maybe it was a fluke, so I tried again, and again the whole switch reloaded.

A few minutes later, I rolled back the update and everything lit up (in a good way).  All connections were working properly.  Of course, I was regretting not just driving in at 11pm and doing that very thing, but who would have thought that every single thing was working besides that due to the update.

My point…make sure you test every possible scenario.  Make sure that there will be support, if possible, before you do such an upgrade (although it wasn’t even a full version upgrade).  Make sure your boss will still like you after you’ve made him come in to the office at 2am.  And, perhaps most important, when you are purchasing equipment, you know someone will be able to support you when you run into buggy software on a weekend or a night…you know…the times when you would be doing network upgrades.

, , , ,

  1. #1 by ITguy on October 19, 2011 - 12:11 am

    agree with you 100% The skeleton crew is useless and unfortunately off hours are really the only times one would do such updates. The 6248 switches are extremely buggy and every time I install an update to resolve a known issue with a documented resolution, something else breaks.

    Too bad is they definitely scream performance wise when it’s up.

Leave a comment