American Airlines (AA) lost tens of millions of dollars in one day recently. ‘Still have trouble communicating to decision-makers that IT (information technology) is a serious matter? Remind them of what happened to AA.
It’s not just AA, of course. Technical choices and management that sound like matters purely for specialists end up affecting hundreds of millions of government documents, over a hundred thousand customers of the Royal Bank of Scotland, or millions of users of tech powerhouse Google.
AA’s episode on 16 April 2013 resulted in cancellation of what was variously reported as between 978 and 1950 flights, representing apparent receipts of at least $20 million, based on a daily run-rate at the operating company in excess of $50 million.
Where did all that loss–conceivably over $100 million total, once all impacts are summed–originate? We might never know. AA originally blamed the independent Sabre computerized reservation system. A few hours later, AA apologized for what it admitted was an error, and Sabre issued a press release that it was “operating as normal” on its side. Within the next days, AA Chief Executive Officer Thomas Horton identified the problem as a network outage that was understood internally and “… will not recur”.
It’s hard to confirm details from the outside. AA operations have been strained in recent years, from anecdotal drama which interferes with flight-crew co-operation, to more systemic difficulties in employee relations leading to poor on-time performance. Clint Boulton almost certainly has it right when he emphasizes that real-world software quality is not about one application showing the right colors for its display; the more important point is that enterprise-class programs inevitably co-ordinate and interact with many different systems. Mishandling a small error response in a Sabre request easily propagates to the extent that numerous physical components–airplanes, luggage, crews, and more–are all effectively lost or crippled.
This is why “Real User Monitoring” consistently heralds the importance of automated application performance management (APM) and allied technologies. Although an application might begin its life in a well-defined and well-understood role, if it’s successful and useful it inevitably grows beyond the point that any one expert or even team can keep up with all its connections and responsibilities. Only automated monitoring techniques can cope with and mitigate the cascades of errors that, in cases such as AA’s, disrupt thousands of lives and dissipate millions of dollars. Even at smaller companies than AA, unplanned outages cost thousands of dollars every minute. Too often, “disaster recovery” at best accounts for only “static” assets and doesn’t properly account for all the systemic relations on which mission-critical workflows rely. While APM itself isn’t a perfected service, it’s the best handle IT has on management of crucial operations.