Netflix, perhaps more than any other web service must keep going, even when it has ridiculous traffic. To that end, it uses Amazon Web Services to allow for elastic coverage, but it also tests its systems regularly with a set of tools it refers to as monkeys — and according to Wired, it intends to open source the whole kit and kaboodle over the next year.
In a post last July on the Netflix blog, it discussed the idea behind the monkey tools. As a cloud-based vendor, Netflix understands that the cloud provides redundancy and fault-tolerance, but it also wants to be able to survive any failure, no matter how obscure or unlikely it might be.
In order to test for that “once in a blue-moon failure” (as Netflix referred to it), it created at tool it called the Chaos Monkey, which causes random system failures to see how well Netflix can handle the situation.
Think of it as Website oversight fire drills.
It didn’t stop there though, it created what it called “A simian army” of Monkey tools including a Janitor Monkey to clean up junk files, a Security Monkey to monitor security vulnerabilities and a Dr. Monkey to do health checks on the system. They even have a Chaos Gorilla to simulate what could happen if an entire Amazon zone goes down (and it’s happened). You get the idea. There are others. (You can review the whole list by clicking the link in the second paragraph).
Why would Netflix open source these tools? According to the Wired story, it’s because it wants to be seen as a leader in the cloud, to make this type of testing more standardized and to attract the best and the brightest. If you’re making cool tools and open sourcing them, that makes you look like a good place to work.
For IT pros, Netflix’s generosity could most certainly be your gain. If you could take these tools and use them on your system, you might avoid problems like the ones faced by sites like Target or the 1940s Census Data sites whose launches were marred because too much traffic brought down the sites.
If you’re like most sites, you do some rudimentary testing based on your best guess of the worst case traffic scenarios. Then you monitor the traffic and you look for issues as they develop over time. Netflix is a bit more proactive than that.
As Netflix makes these tools available, you would be foolish not take advantage of their largesse. It may be the monkey toolkit is designed to work with Amazon Web Services instances, but being open source, there may be room for adjustment there.
Whatever the case, you owe it to yourself and to your company to at least check out these tools and give them a run on your system because if you can proactively test the system to prevent worst-case kinds of outages, why wouldn’t you take advantage of that?
Photo by thegarethwiscombe on Flickr. Used under the Creative Commons License.