We reported last week that a monitoring tool was responsible for last week’s blackout in the southwest, but as it turns out that the initial report wasn’t entirely accurate, and the black out happened when utility company employees took a tool called a series capacitor offline.
The LA Times reported that officials can normally take down a series capacitor, which is the size of a small car, and backup systems take over, but for some reason, the backup systems failed. Officials are still trying to learn why.
The article reports that the system actually worked as planned for about 10 minutes, at which point, power started to go out, first in Yuma, AZ, the site of the substation where the work was being done, then very quickly spreading all the way to San Diego.
Interestingly, the article goes on to say the US grid has 5 9s (99.999 percent) reliablity, but as we wrote the other day in How Much Down Time is Acceptable?, it’s of little solace to the folks who are without power and we’ve heard many reports over the years of such incidents. And if you have power, you don’t really think about it.
There are lessons for monitoring professionals to take away from an incident like a blackout. This network, although somewhat different from the computer variety, has a lot of the same systems in place for monitoring and fall-over.
Many companies, in fact rely on High Availability systems to keep their mission critical systems up and running, even when something catastrophic happens. First of all, this shows that even when you have HA systems in place, it doesn’t mean they always behave as planned or that you can rely on them completely.
Secondly, you can see how one move such as taking down a server (or in the case of the black out, a series capacitor) can have unintended consequences when systems don’t work as planned.
This incident also shows that no matter how carefully you monitor your systems and no matter how many contingency plans you put in place, you can still encounter situations that spiral out of your control.
In this case, utility officials in Arizona thought they were doing routine maintenance and it turned out they put millions of people in the dark. Even today, they are struggling to figure out how it happened and how their systems failed.
This doesn’t mean you shouldn’t plan carefully and try to put backup systems in place. Of course you should and must, but know that sometimes, no matter how careful you are, things will happen that you didn’t expect and that means, you have to be ready for anything.