As a typical red-blooded, sports loving American male I proudly attest my obsession with the NFL – and with it – fantasy football. I, like millions of others, satisfy their fantasy needs through Yahoo’s Fantasy sports platform. Much to our collective chagrin a few weeks back, Yahoo’s service unexpectedly crashed just as games were kicking off. This massive failure certainly had a big impact not only on user frustration, but monetarily (gambling). Here is the email I received from Yahoo in response to the incident:
At Yahoo!, we have giant machines called “filers” that process a lot of the real-time data and stats for us and for you. We do millions of calculations every hour for our games, and normally our machines can handle this with no problem. Recently, we discovered a hardware issue in one of the filers that caused the other one to overload. We replaced some hardware, re-configured the setup, and did some testing. However this Sunday – at approximately 12:15 p.m. Eastern – the new configuration failed. This created an overload on storage capacity and took the Fantasy part of our site down.
We had dozens of engineers from various teams working together to try to determine the cause and fix it. One option was to fall back on another data center, but that would have meant shutting the game down and losing scoring data. We wanted to avoid that at all costs. Ultimately, we were able to move our mobile apps to a back-up data center, free up storage to get the PC version of the game working, and get the mobile apps up in a “read-only” state – meaning you could see scores and data, but you still couldn’t set lineups and interact.
We spent Sunday night and most of Monday looking at dozens of potential causes. Monday afternoon, we stress-tested our system. Everything seemed to be in working order, so we turned on all our mobile app functions in time for Monday Night Football. Everything performed as expected and continues to do so. We’ll have all hands on deck this coming Sunday to closely monitor performance and ensure we can respond quickly in case of any abnormal activity.
This outage serves as a perfect use case for application performance management. Most large scale failures can be mitigated quickly and effectively with the right monitoring tools. It looks like the Yahoo team made some crucial failures throughout the process which made things worse. Lets take a quick look at some strategies Yahoo could have employed to avoid disaster:
1) Start testing early and test progressively – Although the timetable for Yahoo was tight, it seems they could have benefitted from a few extra rounds of performance testing. The earlier you test, the more time you have to prepare for nightmare outage scenarios. You will also gain a clearer picture into the the true peak performance data. Furthermore, if Yahoo tested in production the issue would have become apparent much faster.
2) Know your infrastructure – Yahoo should have asked themselves the following questions: what are the most important transactions? How will you find performance issues in these mission critical applications? How do we see where issues are at peak load? The easy answer would be to have a full end-to-end monitoring capability both during testing phases and in live production. IT resources need to be efficient and responsive. You need to ensure flexibility and allow workloads to be moved around as needed. It seems Yahoo failed to fully anticipate the extra capacity needed if the configuration failed.
If Yahoo practiced the first two strategies they almost certainly avoided disaster all together. However, lets play devil’s advocate and say Yahoo still experiences an outage after heeding our advice. This next step would have guaranteed faster remedy time and allowed Yahoo’s systems to get back up and running significantly faster.
3) Ensure monitoring is in place – Monitoring transactions end-to-end will ensure SLA compliance by given you the most accurate performance data. Having monitoring capabilities in production is paramount during peak traffic. Furthermore, if your monitoring tool has auto-detection features to map your dependencies you will save countless time finding the root cause of a problem. Otherwise, you may be searching for a needle in a haystack. Which is what Yahoo ended up doing: “We spent Sunday night and most of Monday looking at dozens of potential causes.” When monitoring in production, use real time data to ascertain the experience of your end users. Synthetic tools will not give you enough visibility when you experience traffic spikes.
Although software and hardware failures are unavoidable there are plenty of strategies and tools which will give IT an extra layer of insurance. By failing to adhering to common testing and monitoring strategies, both Yahoo and their end users lost out. Although mistakes happen, the public outcry and frustration of your end users far outweigh the resources spent on ensuring these things never happen in the first place. This is certainly a lesson Yahoo should never forget.
What are your thoughts on the Yahoo Fantasy Football outage? Do you think Yahoo deserves full blame or should we cut them some slack?