On the Northeast Blackout

After a month of inactivity, it is only fitting that this site comes back online, after the biggest power failure in North American history. Actually, in terms of the total number of megawatts, I imagine this must have been the biggest power failure in the history of the world. The past 24 hours saw midtown Manhattan – where all power was restored only at 8pm on Friday night – beginning to resemble a scene from 28 Days Later. I was there the day before at 4:15PM when the lights went out. Instantly, word spread that the UN – only a few blocks away – had also lost power. This made people understandably uneasy. Some mobile phones were working while others were not. Those with working phones started sharing that the entire city, as well as neighboring states, had also lost power. Since it was daytime, there was no sense of panic, although it was around then that traffic seemed to stop moving at the midtown tunnel. As darkness fell, people congregated at friends' apartments, took walks with flashlights, and seemed to make the most of it.

The next day, I noticed people starting to buy up the unrefrigerated inventory of local delis. Power had been restored north of 40th street, and there was a 30 minute lineup at the McDonald's which extended well beyond the restaurant entrance. As all power in the city was restored before darkness finally fell, the uneasiness people were beginning to feel started to dissipate as everyone realized the situation had gone from being a potential crisis to a mere inconvenience.

A High Profile Software Failure
It is remarkable that even now authorities seem to have no idea how this could have happened. Explanations in the media have been very simplistic – that the failure of one station instantly offloaded demand to another station which was unable to compensate, so it too failed, and the sequence was repeated across the Northeastern United States, Ontario and Quebec. You hear that story and think "Gee, why didn't anyone think of that?" Of course, they did, and electric companies have a century of experience distributing power loads amongst disparate facilities very successfully.

Many are speculating on whether this could have been the work of a malicious hacker or terrorist group. While such a theory drastically overestimates the hacking capabilities of terrorists as well as the maliciousness of hackers, it isn't impossible. But it seems unlikely, considering the number of systems which such a group would need undetected access to and the sophistication of the hypothetical exploit. If this is a hack, insiders with very high levels of access would need to be involved. In the current political climate, that seems very unlikely -- and it seems unlikelier still that anyone would go to the trouble, succeed, and then not claim responsibility immediately.

So what went wrong? While the original cause of one plant going down could have been anything, the successive cascading blackouts point directly to a computer software failure. The specific time and way in which the cause of the problem originated created a set of circumstances that the software designers had not anticipated. Safety is the paramount concern at power plants – especially nuclear facilities – so an automatic shutdown can be triggered by anything the software identifies as an unsafe scenario. The trick is to successfully identify all problems which do not require shutdowns and implement solutions instantaneously.

The implications of the power grid failure highlight not only the frailty of the national infrastructure, but also the fact that we're at the mercy of our computer software for sustaining our way of life. Movies often give science fiction scenarios of machines running amok and turning against humankind. In the real world, while the software controlling our essential services could not (really) act maliciously, its failure can sometimes have the same effect as an AI robot trying to harm us.

Just as each generation of software is more powerful than the last, it also increases in complexity. The number of unknown scenarios for which the systems are untested will only increase. At the same time, knowledge of the fundamental vulnerabilities of computer systems is the only real defense against both malicious attacks and unintentional failures. I only hope that the government does not create a bill requiring the software industry to take on financial responsibility for the effects of its products failing. Software, and software development, are getting better. I think a high profile failure like this will only serve to reinforce the importance of well-designed and thoroughly tested software in mission critical systems.

© Creative Commons License 02/8/2010 14:16:30