
All data engineers are looking for the latest trends regarding the Vs (such as Volume, Variety, Velocity and others) of Big Data. Most approaches lead to higher ingested Volume of enterprise data, an increased Variety of enterprise or cloud source systems, with constant Velocity demand. Data engineers and data analysts are looking to migrate data warehousing to the Cloud to increase performance and lower costs.
Big Data architects are looking for new intelligent solutions to govern the data swamp and in the end, to create robust security models to protect data and manage “data lakes” with full respect for compliancy. The rules of the game always push the return of investment to the limit, so most of the time organizations need to find a balanced technical solution between open source technologies and proprietary/commercials ones.
From an engineering point of view, a big data discussion always starts with the cluster type. At the beginning most of the clusters were built “on premise”, but evolution led to the “public cloud” and nowadays we get full benefit of “hybrid” ones.
Anything related to provisioning, dynamic commissioning and decommissioning is already offered as a service – Infrastructure (IaaS) or Platform (PaaS) as Service. Amazon EC2 (Elastic Compute Cloud) provides a full custom integrated scalable environment but also leaves space for open source platforms like Cloudera, Hortonworks and MapR.
From a financial point of view, the hotspot instances combined with Amazon EMR (Elastic MapReduce) services definitely raise the bar in terms of capacity planning. IaaS & PaaS are already mature enough to offer solid support to embrace the transition from capital expense to operational costs. In the same context there are few big questions: “Which solution is the most cost effective? In terms of licensing costs or support costs?” I would respond that it’s a combination of them.
From a strategic point of view any full integration with a big data platform strongly affects “independence of work”. Amazon’s Big Data Platform has definitely heavily integrated open source technologies, an area where it is a big contributor, but it also offers innovative services. Some examples are related to data persistence services like S3, Glacier, EBS and Hadoop’s HDFS.
In addition to this brief blog introduction, I am also providing a video presentation from Ness TechDays virtual conference that consists of 2 distinct parts:
- The 1st part reviews where enterprise systems end and big data solutions begin.
- The 2nd part is a comprehensive comparison between Apache open source projects and Amazon AWS. This snapshot of the current valuable technologies in the Big Data ecosystem is meant to shorten the time needed for architectural decisions.
The comparative approach covers architectural aspects, such as cost model, performance, availability, scalability and elasticity for analytics and data warehousing, outlining available AWS services and open source alternatives.
The final goal of the presentation is to offer a reference for a typical transition of a software solution from “on premise infrastructure” to “hybrid cloud infrastructure.” View the full “Big Data Open Source Projects vs Amazon Web Services” presentation here.