Background and Need
Enterprises are filled with structured and semi-structured data generated by all sorts of transactional applications. Much of that data is increasingly finding its way into Hadoop clusters for analytics because of Hadoop’s flexibility and the inexpensive, linear scalability of its compute and storage. To extract insights from this data, enterprises often build data lakes consisting of structured, unstructured and semi-structured data, with SQL still being one of the most favored option for querying it. A large majority of global data and analytics technology decision-makers, end-users, and business users are interested in, planning, implementing, or expanding SQL on Big Data.
SQL is 42 years old, first invented by IBM in 1974, and Hadoop is a new 11-year old kid making a lot of noise since its emergence from Yahoo in 2005. Hadoop has data (Hadoop being the de facto standard in Big Data, to the point where it is being used interchangeably with Big Data) and SQL has the popularity and a well-known way to query it. For the majority of enterprises with a global data mix who are looking at implementing Big Data or expanding on Big Data, there is the rising ask of robust SQL on Big Data. This has become a must for the following reasons:
- SQL being the universal connector for many BI tools and technologies
- Abundant SQL skills can be leveraged to mitigate some of the Hadoop skill gaps
- Hadoop offers more data and more analytics at a very low cost
Key factors to consider when evaluating SQL on Big Data, which should be ranked as per the organization’s business context and needs:
- ANSI compatibility varies widely in SQL engines: It is important to understand the SQL engine’s current level of support for a specific ANSI SQL standard. Pure SQL Hadoop engines tend to have the biggest gaps because they were developed from scratch in the last few years and are still implementing a road map to eventually become more ANSI SQL compliant.
- Joins: Executing joins on big tables quickly and efficiently is not always easy, especially if the SQL on Big Data engine has no idea where the data is stored. An inefficient style of join can result in massive IO and massive shuffling of huge data sets being moved between nodes, resulting in poor performance.
- Data Federation: Not all data is stored in Hadoop. Most enterprise data is still in relational databases. SQL on Big Data engines should be able to support distributed joins on data stored in all kinds of data sources. This sometimes is also referred to as push down capability, and being able to do it means there is no need to replicate the data.
- Performance: It depends on a number of factors. Understanding the specific nature of the SQL use cases is extremely important for selecting the right technology. Benchmarks such as TPC-DS (Transaction Processing Council for Decision Support) attempt to mirror real-world workloads to measure and compare performance between the engines, but vendors often don’t follow the TPC procedures or fully audit the results. One has to carefully evaluate an SQL engine’s claims with regard to scalability, performance under load and concurrency as some of the key performance consideration.
What’s Available today
There are dozens of SQL on Big Data solutions that can offer something for everyone. The first SQL on Big Data was Apache Hive. In the last few years, many others have joined the ranks, which include CitusDB, Cloudera Impala, Cascading Lingual, Hadapt, InfiniDB, JethroData, Apache Drill, Pivotal HawQ, Apache Phoenix, Apache Spark SQL, Apache Drill, Presto, Actian Vortex, Teradata QueryGrid, Microsoft PolyBase, IBM Big SQL, and HP Vertica for SQL on Hadoop.
SQL on Big Data is essential for Data Lakes, as enterprises are building data lakes full of structured, semi-structured, and unstructured data. SQL for Hadoop is getting good, good enough for Data Warehouse use cases. SQL-for-Hadoop, especially the commercial SQL engines from the database vendors, is American National Standards Institute (ANSI) compatible and fast. They are getting so good that, for some use cases, they are starting to give traditional data warehouse appliances a run for their money. Some of the best SQL-for-Hadoop solutions come from those vendors.
It would not surprise me if every organization that uses Hadoop eventually deploys a SQL-on-Hadoop engine (or maybe even a few). As organizations compare and evaluate the available technologies, assessing the engine’s capabilities for the key factors listed in this article would be a great starting point.