The Hadoop ecosystem has a well-known hole: there is no good opensource tool for near-real time analytics on Big Data. For example, suppose you have 500 milliseconds to calculate the average temperature reading from a set of sensors, based on some ad hoc user-specified criteria such as location, time and sensor type. Pre-aggregation is not feasible, because you cannot anticipate what search criteria the user will specify. Hive is too slow, even when it runs on Tez in Hadoop 2.0. Cloudera’s Impala may be fast enough, but it makes so many assumptions about what it can fit in memory that it is notoriously unreliable.
Amazon Redshift is meant to fill that hole by handling analytic workloads on large scale datasets in near real time. Redshift is built on top of the ParAccel analytic database, designed by a brilliant serial entrepreneur (and ex-colleague of mine at Applix) named Barry Zane. ParAccel was based on the following principles:
- Start with Postgres and add analytic extensions. By using Postgres as the foundation, ParAccel benefited from a well-developed ecosystem of tools, e.g., for ETL and query optimization. ParAccel then extended SQL with mathematical, statistical and data mining functions, as well as a language for creating user-defined functions.
- Support columnar orientation for tables. In row format, all the columns for a given row are stored consecutively on disk. This is great for transaction processing, since updating a single row requires only one I/O operation. But, it is not so great for analytics, which typically processes a single column value for all rows. In row orientation, this requires one I/O operation for every row. In columnar orientation, on the other hand, all values for a given column are written consecutively to disk. This is not so great for transaction processing, since updating a row requires one I/O operation for every column in the row. But, it is ideal for analytical processing, since a typical analysis can be satisfied via a single read of all values for a single column.
- Base the architecture on Massively Parallel Processing (MPP), with a shared-nothing architecture. The leading cause of poor database performance is contention over shared resources, such as table rows or memory. An MPP architecture avoids these bottlenecks by sharding the data over multiple servers, so that each server has all the resources and information it needs in order to perform most queries. Other databases such as Teradata and Greenplum are based on MPP, but ParAccel succeeded in making its MPP more robust than most of the others.
Amazon took ParAccel and adapted it to the cloud by adding features such as elasticity, so that the database cluster automatically scales up to handle demand bursts. Amazon then slapped on a very aggressive price tag of only $1K per terabyte per year, a fraction of the total cost of ownership for alternatives such as Teradata, Netezza and, yes, even Hadoop. The total managed offering was re-branded as Amazon Redshift.
The result is a game-changer for data warehouses. Suddenly Big Data Analytics is available to small and mid-sized organizations, who cannot afford to purchase and manage any of the alternatives. It cannot do everything the alternatives can do (e.g., it cannot handle petabytes of data like Hadoop), but it completely undercuts the alternative solutions by trading away generality for price. It can also serve as a complementary technology to Hadoop by filling in Hadoop’s missing capability for near-real time analytics on Big Data.
And yet, I am hesitant to recommend Amazon Redshift to Ness’s Big Data customers, unless there is no other good Open Source alternative. The reason: I am afraid of vendor lock-in. No one is better than Amazon at making vendor lock-in feel so good. The price is great, and you’ll never have to worry about provisioning, database maintenance or version upgrades.
But, let’s be clear – choosing Redshift locks you in to Amazon, because once you develop your queries using Redshift’s dialogue of SQL, you will find it impractical to switch to anything else. Yes, the terms offered by Amazon are extremely reasonable, but high tech is replete with stories about victims of captive pricing, who are forced to pay whatever updated price the vendor demands, because there is no alternative.
I once experienced firsthand the dangers of captive pricing. My start-up company needed a map widget for our web site, and we could choose either Google’s free widget or a competitor’s commercial widget. Free seemed like a good price, so we went with Google. Just as the web site started to grow in popularity, we received an email from Google telling us that, effective immediately, their map service would no longer be free, but would cost twice as much as the competing service. We had no choice but to pay the fee because the alternative would have been to shutter the web site just as it was gaining popularity.
I’ve never forgotten that feeling of having been taken in by attractive pricing, then discovering I had no choice but to pay a much higher price. So, when analyzing a customer’s problem, I always try to find a reasonable Open Source alternative to Amazon Redshift. Sometimes, when there is no good alternative, e.g., when the use case demands near-real time analytics, I recommend Redshift, because it is a fine product with a fine price. But, I also tell the customer that there are no guarantees that the pricing terms will remain so attractive down the road.