Though Maturing, Hadoop Ecosystem Has Room to Grow

//Though Maturing, Hadoop Ecosystem Has Room to Grow

Though Maturing, Hadoop Ecosystem Has Room to Grow

With the recent announcement of Hortonworks offering Apache HAWQ as Hortonworks HDB, the ecosystem around Hortonworks and Hadoop is continuing to evolve. As a 10-year old, Hadoop has become a critical foundation for many Big Data initiatives at most global enterprises even when its relevance is questioned time to time with latest innovations in Big Data technologies.

As the foundational technologies continue to become better, and innovations continue to add value or close gaps, customers and developers are forced to grapple with figuring things out on their own. Such experimentation is expensive and leads to customer frustration. Hortonworks’ efforts to offer supported versions of complementary open source software can certainly help customers and developers. However, it is important for customers to consider some critical aspects before making the decision to go with some of these offerings. Any technology that is adopted by a customer requires considerable investment in training, skills acquisition, building a tools infrastructure around the technology, and investment in design, tuning, and operational effort to make it all work, not only for a small initial deployment but for future deployments spanning multiple projects and workloads. Only then the choice of the technology starts paying returns on that investment. Most companies have to do this without armies of highly skilled and paid experts to make that technology work. Experts that are hard to find. The decision requires considerable deliberation.Watch our Webinar Archive

Let us understand what the introduction of HDB as part of the Hortonworks stack means for enterprises in the context of the Big Data movement:

    • There is a need for SQL based access to Big Data. It validated the need for SQL based solutions to rapidly access the data that is being locked up in Hadoop. As more enterprises invest in creating Enterprise Data Lakes and filling it up with more data, it is important to find an easy way to get the data out of the lake and do relevant queries and analysis to deliver value back to the business.


    • IT needs to play a bigger role in delivering value from Big Data. It has highlighted the need for enabling IT organizations to play a bigger role in delivering value. Depending on the dynamics of organizations within an enterprise, the ownership and access to data can be tricky. Businesses may sponsor data scientists to derive benefits from advanced analytics but IT organizations may not have access to the same investment to hire their own data scientists for supporting their applications.


    • Do more with Big data to get more. Thus far the major focus has been on complex analytics and MapReduce algorithms that are essentially batch processing in nature. As analytics workloads come to bear via either MapReduce jobs or SQL based solutions on top of Hadoop, there is a realization within IT that they can do more with Big Data to enable some of the operational workloads and potentially consider less critical transactional workloads that can be hosted within their data lake. This realization is slowly sinking in as businesses demand more from their Big Data investments and IT is grappling with the methods to deliver to those demands.


    • ANSI SQL is a must. As many open source projects backed by different vendors started offering SQL support, many developers were content with partial or limited SQL support as they tested out what they could do with their Big Data investment. Initially, applications were fairly advanced and typically delivered by ISVs or solution providers. As more IT apps are leveraging Big Data and the enterprise Data Lake, it is a must to have a more complete support of ANSI SQL to modernize or transform existing SQL based apps that can leverage this consolidated enterprise data in the Data Lake.


    • Bring Analytics into the Database. Uncovering and leveraging insights to serve customers better; deliver new services; and generate more revenue is the promise of Big Data. However, stars need to be aligned before the promise of Big Data is fulfilled. A key aspect in delivering new insights is to apply advanced analytics to enterprise data held in a central Enterprise Data Lake. In order to achieve that, data movement and duplication has become a major bottleneck for many an IT organization. Also, with the advent of SQL engines on top of Big Data platforms, the increased expectations from IT organizations and data scientists alike is to be able to deliver analytics from the database using SQL based queries as well as the ability to run complex algorithms via extensible User Defined Functions (UDFs).


    • Proprietary RDBMS vendors’ support for Hadoop falls short – Many a proprietary RDBMS vendor have announced their support for Hadoop, more to bring data out of Hadoop to their proprietary storage formats to run queries. This defeats the purpose of an enterprise data lake, and adds to the complexities and latencies associated with data movement and duplication at several layers. It can be a nightmare for IT organizations to keep all this data in sync.


  • Performance demands are growing. SQL engines are being put to a test as the data volume, variety, and velocity continue to increase and commensurate with that so does the complexity of queries with the number of joins required or the type of data (structured, semi-structured and unstructured) being processed by the Big Data apps.

The State of SQL Engines for Hadoop

Coming back to choosing HDB or Apache HAWQ versus other SQL databases for Big Data, it is important to note the problems customers are experiencing with the current solutions:

  • Improper memory management hurts performance and destabilizes clusters even for the simplest of queries
  • Inefficient use of parallelism resulting in poor response times or throughput
  • Inability to join a large number of tables pushes developers to write complex and inefficient MapReduce jobs
  • Lack of sophisticated query optimization techniques resulting in subpar query performance
  • Not being able to support updates especially while reading (only being able to access immutable data)

Esgyn Core Tenets

With EsgynDB, we have an unfair advantage of being a mature, 4th generation SQL engine that addresses the above mentioned challenges. Here are the fundamental tenets at EsgynDB:

    • Democratize Big Data and High performance computing – while Big Data initiatives currently demand new skillsets, it is our belief that building, maintaining and leveraging Big Data needs to be democratized, by enabling enterprises to take advantage of existing resources to get the benefit. Our open source contributions via Apache Trafodion puts the power of a mature 4th generation SQL engine in the hands of the SQL developer community, to transform existing SQL-based apps to leverage the data from enterprise data lakes.


    • Accelerate Value from Enterprise Data Lake – as more customers deploy data lakes and wonder how to derive value from them, we believe SQL based solutions are the only answer to this quandary.
      Bring all workloads to one single SQL Database – Esgyn is driving this transformation by addressing all workloads, including reads, writes, joins, operational queries, ACID transactions and analytics. Most SQL engines in the market focus on analytic workloads instead of enabling far more value from Big Data.


    • Minimize / eliminate data movement and duplication – Once the data lake is created and data is flowing in, there shouldn’t be a need for data movement out of the lake for a specific app. EsgynDB reduces the need for data movement and duplication. This has been a major constraint for operational data stores and BI projects, though it might have been played well for proprietary RDBMS vendors. By hosting operational workloads with EsgynDB you reduce data movement and duplication from proprietary operational systems. Also, you don’t need to aggregate or duplicate data out of the Data Lake since EsgynDB allows fast processing of the data using SQL right on the same Big Data platform.


    • When it comes to SQL Engines, experience and maturity matters. While it is great to see considerable innovation in the Big Data space, Esgyn is proud of its Tandem and HP roots, where we spent the last couple of decades and hundreds of millions of investment to perfect a SQL engine that ran all types of workloads for global enterprises. Rome wasn’t built in one day!


  • SQL Engines need to optimize for different Big Data storage formats. In a proprietary RDBMS, the vendor optimizes the integration of the query engine with the storage engine to enable faster parallel processing to deliver superior functionality and performance. When it comes to open source Big Data solutions based on Hadoop or other storage formats, it is the job of the SQL query engine to have an optimized integration with the storage engine to deliver on the performance promise. EsgynDB with its unique pluggable data management framework approach, delivers on this promise by deep integration with HBase, Hive, ORC, each of them optimized for relevant data models, workloads, and performance characteristics.

Before You Choose

Before you decide on a specific SQL engine for Hadoop, ask the appropriate questions. Our team has put together two resources to that effect.

  • Download the checklist to compare SQL Engines
  • Download the SQL Engine Requirements paper to know what questions to ask your SQL on Hadoop vendor.

Join our Webinar on July 12 at 10AM Pacific


About the Author:

Rohit Jain is Esgyn's Chief Technology Officer.Rohit has worn many hats in his career, including solutions architect, database consultant, developer, development manager and product manager. Prior to joining Esgyn, Rohit was a Chief Technologist at Hewlett-Packard for SeaQuest and Trafodion. In his 39 years in applications and databases, Rohit has driven pioneering efforts in Massively Parallel Processing and distributed computing solutions for both operational and analytical workloads.