When NoSQL databases first appeared, in response to the need to process high volumes of unstructured data, the “o” in NoSQL was small, and it meant that these systems had “No” SQL support – SQL was meant for traditional structured relational databases, not these new schema-less contraptions. But, developers know and love SQL, so these products soon developed SQL-like facades, and the “O” in NOSQL became capitalized – NOSQL meant “Not Only” SQL. These SQL facades, such as Phoenix for HBase and CQL for Cassandra, vastly accelerated the adoption of NOSQL databases, by lowering the barrier to entry, because developers could access the database via a language they already knew – a subset of good old SQL. But, a lower barrier to entry also creates a new problem – programmers have no idea what is going on behind the SQL façade, and, as a result, create programs that are wildly inefficient, far less efficient than the equivalent program in a traditional relational database.
Let’s take Cassandra as an example. In the old days, before Cassandra’s SQL façade, every Cassandra programmer knew that a Cassandra row key is different than a column, because a column is stored as a <name, value> pair, while the row key is stored as a nameless value. And, every Cassandra programmer knew that the only thing Cassandra sorted was the column names within a row – not the row keys, and not the column values. So, if you needed to retrieve sorted data, you had to place all the data in one row, and “promote” the data to be sorted so it was part of the column name. There was elegance in the simplicity of this model, and Cassandra programmers became experts in concocting column keys that sorted all the desired data within a given row in just the right order. Want to sort some columns (e.g., events) by time, and sort other columns (e.g. names) alphabetically? No problem – just encode some of the column names with timestamps, and some of the column names with text, and Cassandra will automatically sort them as separate groups.
Then, along came CQL, an SQL-like façade that makes Cassandra look like a relational database. The row key looks just like any other column. And, rather than requiring the user to manually concoct column keys, CQL lets the user define one set of hierarchical clustering columns whose values get “promoted” into the column keys automatically. So, instead of having different sorting strategies for different parts of the row (e.g., some by time and some alphabetically), CQL enforces a single sorting strategy for all columns in the row.
Nowadays, when I interview candidates who know Cassandra, I am amazed at how little they understand about what goes on behind the CQL façade. For example, few CQL programmers can explain why it is a bad idea to select a range of row keys, e.g., from January 10 to January 12, 2015. (Remember what I said about row keys being stored unsorted? To find the selected range, Cassandra will have to read every row in the table, which could take hours.) They have been fooled by the façade into thinking they are using a full relational database, and are disappointed to discover that their queries run much slower on Cassandra than they did on Oracle or SQL Server.
Don’t get me wrong. I am not proposing that we abandon the SQL facades for NOSQL databases – they make maintenance and programming dramatically simpler. But, don’t be fooled by the façade – always make sure you understand how the data is organized behind the façade, and make sure your queries are accessing that data efficiently. Otherwise, you are in for nasty performance surprises as your NOSQL database reads billions of rows from disk to answer what appears to you to be a simple query.