A recent conference for open source software developers included a full fledged subconference that dealt primarily with the issues of big data: OSCON Data. And no wonder–the process of dealing with millions and billions of data records is performed by mostly open source software tools.
Big data is an inevitable outgrowth of Software as a Service (SaaS). As more and more SaaS services are launched, the data being exchanged expands exponentially. A February IDC report, “The Big Deal About Big Data,” estimates that data use in the next decade will amount to 35.2 zettabytes. One zettabyte, by the way, equals 1,009,511,627,776 gigabytes.
In other words, a lot.
To tackle the challenge of managing big data, traditional databases are often not up to the task. Instead, non-relational databases are often deployed. This class of database is often referred to as NoSQL databases. It’s not that NoSQL databases don’t have Structured Query Language (SQL) it’s just that they don’t use a standard query language, so they can use whatever they want–including SQL.
Other differentiators of NoSQL databases is that they all have their own APIs, and somewhat alarmingly to anyone unfamiliar with NoSQL concepts, “they are not the most scalable, simple, or flexible databases around.” None of these features may initially make for compelling selling points: varied query languages and APIs, and a confirmed admittance that non-relational databases may not be the best at all of the qualities typically associated with a “good” database. But there’s something else going on here that quickly makes you realize where the benefits of non-relational databases truly lie.
The key is the notion of being the best at everything. That’s not what non-relational databases are about.
To understand this statement, one must step back and see the broader theorem that dictates the infrastructure of relational databases: ACID (Atomic, Consistent, Isolated, and Durable), which are core aspects that must apply to all data within a relational database. Data is broken down to atomic values (name, address_1, city…) while remaining consistent across the database, isolated from other transactions until the current transaction is finished, and durable in the sense that the data should never be lost.
The infrastructure of a relational database is well-suited to meet this criteria for data: data is held in tables connected by relational algebra. But in a nonrelational, NoSQL database, at least one or more of these ACID criteria are dropped.
One of the biggest objections organizations might have against NoSQL databases hinges on this approach. They aren’t willing to make a move to NoSQL because they can’t give up ACID. Particularly the “C,” because not having data consistency is a terrifying prospect for any company dealing with financial transactions. Which is just about everyone.
Yet non-relational databases are being used by firms like Amazon and Google every day, with great success. Amazon, in particular, needs to track millions of transactions on any given day–so how do they get away with inconsistent data?
The simple truth is, they almost have to. The trade-off would be a relational database that could never keep up with the speed and scaling necessary to make a company like Amazon work as it does now. Recall that non-relational databases are structured to sacrifice some aspect of ACID to gain something in return. In the case of Amazon, their proprietary non-relational Dynamo database is willing to apply an “eventually consistent” approach to their data in order to gain speed and uptime for their system when a database server somewhere goes down.
Here’s an example of how this would work for Amazon and similar web services. Books, for example have a variety of different datapoints within the system for which Amazon tolerates inconsistency: price, ratings, or location. If a server is down or a change has been made that hasn’t yet propagated to all the data containers in the Amazon system, there is a chance that one of these datapoints will be inconsistent when a customer comes in to buy that book.
Amazon’s goal is that such inconsistencies will be eventually resolved, with “eventually” being right before the customer is formally billed for the book. Sometimes customers will see the inconsistency and can still continue with the sale: subtle changes in shipping dates or ratings for the book. Sometimes, the inconsistency isn’t resolvable in time: a book sold out just as the buy now link was clicked by another customer, for instance. In this case, customers may see a “sorry, there’s a problem with this order” screen and be directed to reserve the book instead. Or, if the inconsistency lasts a really long time, an e-mail is sent to the customer apologizing for the error and alternatives to rectifying the situation.
Saying “we’re sorry,” ultimately, is cheaper than no sale at all.
Think about all of the complexities of this type of transaction and now try to apply application performance monitoring. How would APM look, then, in an environment where the business transaction itself has a non-zero but small probability of failing?
In part two of this article, we’ll examine the challenges of monitoring application performance in a big data world.