At Esgyn we have seen many a customer wanting to transition applications from their Data Warehouse Appliance, such as Greenplum or Netezza, over to Hadoop. Many of these customers have already deployed Big Data applications on Hadoop, either on the Cloud, or on-prem with distributions from Cloudera or Hortonworks. Motivated by the:
- tremendous cost savings for these data warehouse applications, given the cost of specialized hardware, steep licensing and support fees, and cost of operating specialized environments
- ease of migration of the SQL application to a Hadoop based relational database engine
- consolidation of data resources on a single elastically scalable commodity platform, either on-prem or on the Cloud (AWS or Azure)
- ability to dynamically provision a deployment in a virtual environment on such a platform, with the ability to adjust resources dynamically based on usage and growth
- flexibility of leveraging the appropriate row and column store depending on the workload
- open format of these data stores so that data is not trapped in a proprietary format
- ecosystem that includes the same but even a richer set of BI and Analytics tools
- ability to integrate Big (unstructured) Data using a richer set of data models supporting documents, search, and graph, along with full-fledged ANSI SQL
- potential to move toward a streaming real-time paradigm
- ability to incorporate Machine Learning and deeper analytics as an adjunct to BI
it would almost be negligent not to consider this, now very enterprise ready, platform to replace your Data Warehouse Appliance.
Data Warehouses need to cater to high concurrency mixed workloads
Proceed with caution: You have to be careful about solutions that were designed for Big Data Analytics that now claim they can handle Data Warehouse workloads. Data Warehouse workloads can span a spectrum from operational to business intelligence (BI) to analytical queries. In that, many queries may require sub-second to seconds response times, seconds to minutes response times, and then minutes to hours response times. The times vary by the amount of data these queries will process, their ad hoc or scheduled nature, and their complexity. But in all cases data warehouse workloads have very high concurrency requirements with a large percentage, often upwards of 90% of the queries, being executed in seconds to minutes. Analytical workloads, on the other hand, have fewer concurrent queries, with a very large percentage of them executing in minutes to hours. A rough outline of the spectrum of these workloads is presented below. The bulk of a database warehouse workload will fall into the BI queries / reporting category.
So, why is this relevant? Because many of the products geared towards analytics are not designed for very high concurrency. High concurrency requires:
- queries to manage memory efficiently
- overflow to disk for big memory operations so that other queries are not impacted
- right-sizing the resources allocated for the execution of the query (degree of parallelism) instead of always using the entire cluster or instance to execute the query
- incremental resources allocated to it as it continues to need more resources
- and the ability to handle skew in joins as the workload is scaled out
Keyed access to data for operational workloads, in order to retrieve a small number of clustered rows, becomes very important, even if done in parallel, versus full scans of the table, in spite of how good the storage engine might be in pruning out the data not needed. Different join strategies are required depending on the nature of the workload. Much of this is discussed in an O’Reilly report In Search of Database Nirvana. But suffice is to say, that a database engine on a scale-out platform, such as Hadoop, needs the ability to handle a mixed set of highly concurrent workloads, with the ability to leverage the appropriate storage engine for that workload.
EsgynDB’s Data Warehousing heritage
Other than EsgynDB, you will be hard-pressed to find a database engine on Hadoop with the experience and maturity to address such a variety of workloads. Previous incarnations of EsgynDB have run data warehouses the size of 250TB with more than 250 concurrent queries running on a 64-node cluster, delivering 15,000 reports a day with an average elapsed time of less than 3 minutes. Alongside, there were 1,000 extracts running per day.
Separately, there was massive Extract, Load, and Transform processing that was run daily, aggregating the data for reporting – 6 hours of intense rollup processing across 50 subject areas, running more than 20,000 SQL statements, across 3,500 ELT jobs. ELT is a key workload for Hadoop based data warehousing, where customers want to leverage the scale-out power of Hadoop for ELT, instead of separate ETL platforms to do that same.
The numbers above are from some five years ago. They have increased substantially after that.
EsgynDB performance relative to Greenplum
But let us just take EsgynDB itself. One of our customers, one of the largest New York based global banks, motivated by the above proposition, had tried to transition from Greenplum to Hadoop using Hive. They were struggling to get the desired performance. They assessed EsgynDB against both Hive and Impala using a mixed workload that they ran on Greenplum, consisting of:
- Simple: Queries that did direct selects
- Medium: Queries with few joins and transformations
- High: Queries with multiple joins & transformations at high volumes
For the simple direct select queries, Trafodion tables based on HBase were used for low latency keyed access. These queries run across 20 threads for 20 minutes, tested concurrency performance, where EsgynDB demonstrated more than a 16x improvement over Greenplum. Hive MapReduce and even Hive using Spark, were substantially slower than even Greenplum.
For the medium queries, the customer ran two queries at concurrency levels of 1, 5, 10, and 20. The first query performed a four table join, including a nested join, producing 8 million rows of output, around 1.24GB, accessing a couple of tables that had around 40 million rows each. The second query did a 6-way join, including two nested joins, producing 1.2 million rows, or 0.83GB. Normalizing for differences in hardware used for Greenplum and EsgynDB, EsgynDB ran these queries 40% to 70% faster.
For the high queries, the customer ran a single query returning 14.7 million rows, or around 14GB, from tables ranging in size from 1.6 million to 14.5 million rows, with rank, case, cast, and distinct functions. Here EsgynDB really seemed constrained by the fact that it had half the memory and half the number of cores to run. Even after normalizing the results, although at 10 concurrency EsgynDB’s results were at par with Greenplum, at the other concurrency levels it was constrained enough to do 50-60% worse than Greenplum.
In other words, your mileage may vary based on your query mix. The customer ultimately chose to move forward with EsgynDB to replace all its Greenplum applications. They felt that it provided the performance they needed, given the constraints where they wanted to run multiple applications on EsgynDB as tenants on a large Hadoop cluster, with other tenants utilizing the same infrastructure.
If Hadoop is part of your enterprise data consolidation strategy, then you should consider EsgynDB as part of that strategy, in order to run all workloads across that platform, instead of using disparate technologies for your varying workloads. The latter can result in multifold skill and operational investments.
EsgynDB performance relative to other Hadoop query engines
Given a set of queries running against an existing non-Hadoop database, with a proprietary storage engine, that may have been tuned for that workload over time, a database like EsgynDB, running on Hadoop, against storage engines like Apache ORC or Apache Parquet, may do worse at some of the queries without some tuning. But then the question arises as to how does EsgynDB do against other Hadoop query engines?
The customer ran a number of benchmarks comparing EsgynDB to both Hive and Impala running on both Apache ORC and Apache Parquet, both of which are also supported by EsgynDB. Six queries that they ran in Hive, but could not run in Impala, were at least 3x faster in EsgynDB than Hive. The size of the Hive system was 2.76x the size of the EsgynDB system in terms of memory, which can make a substantial difference in such queries. The normalized execution times would be upwards of 8x faster.
But no one should ever assess a database engine comparing singleton runs of queries. No one runs one query at a time in production. As discussed earlier, for a data warehouse concurrency is of primary importance. When a couple of the queries were tested at 5, 10, 20, 30, 40, 50 concurrent users, the performance difference between Hive and EsgynDB at increasingly higher concurrencies were stark. The difference in the non-normalized execution times went from 1.5x at 5 concurrent users, to 2.7x at 50 concurrent users for one of the queries. And from 2x to 4.5x for the other query.
A similar test was done for two of the application queries that ran on Impala, against EsgynDB, at similar levels of concurrent users. One query was 2.2x to 2.8x on average faster than Impala at the six concurrency levels tested. The other query started at the 5 users concurrency level with Impala doing 30% faster than EsgynDB. But as the concurrency level was increased to 10 through 50, EsgynDB did up to 5x better at the 50-users concurrency level.
A key point to make here is that when you are at the lower concurrency level, a 2x difference is say a query running in 1 second vs 2 seconds, whereas at the 50 concurrency level the difference is the query running in 5 seconds vs 10 seconds. So, the gap increases as does the concurrency level, indicating that even at a constant 2x at all concurrency levels, the curve for EsgynDB is flatter than that for other databases, which is spiraling upwards. It supports the claim that EsgynDB is very good at managing resources efficiently, the higher the level of concurrency.
Again, on any single query the results of concurrency tests may vary as well. But the bottom line is that EsgynDB will do much better than either Hive or Impala at higher concurrency levels, due to the architectural maturity we discussed earlier, that is required to excel and use available resources optimally at higher levels of concurrency. And with tuning services from Esgyn, from one of the most experienced database experts in the industry, the performance should far exceed what you can get with any other SQL on Hadoop engine.
Ease of migration from Greenplum to EsgynDB
Migrating the application from Greenplum to EsgynDB required no major re-writes to the queries. It required minor changes for data type casts in joins/predicates/transformations. Other minor function and timestamp considerations had to be made. Minor modifications had to be made to the QlikView application. But the migration there as well was very modest in moving from Greenplum to EsgynDB.
In conclusion, as you formulate a data consolidation strategy on a scale-out platform like Hadoop, for the reasons outlined earlier, it behooves you to consider EsgynDB as the single database platform for all your database management needs, from transactional, to operational, to BI, to analytical workloads, all running concurrently, delivering to service level objectives outlined for each tenant running on the platform. No other such scale-out database solution will give you anything comparable.
Over and above what is discussed in this Blog, the cost of the solution could be one-fourth of what it is costing you now, further improving the price and performance ratio dramatically. Contact us to learn more at firstname.lastname@example.org