The best benchmark to assess the capabilities of a database for BI/Analytics workloads is the TPC-DS benchmark. While the deep integration with Apache ORC and performance tuning of EsgynDB for TPC-DS type workloads is still very early in its maturity cycle, the results are already very impressive.
While for the operational benchmarks there was really no established competitor to compare the performance against, for TPC-DS we used Hive, using the Tez engine against ORC to compare the performance against EsgynDB.
The benchmark tests were run at the 10TB TPC-DS scale. EsgynDB can run all 99 TPC-DS queries with minor modifications that were both ANSI and TPC-DS spec compliant. Hive was only able to run 65 queries.
When comparing the EsgynDB performance numbers to the 65 queries that could run on Hive, EsgynDB was 5 times faster than Hive. Even the total time it took EsgynDB to complete all 99 queries, was quite a bit less than it took Hive to complete 65 queries.
The other impressive story is these tests were run on 2 systems: an 8-node cluster and a 12-node cluster. Essentially for EsgynDB to scale linearly it should have run 1.5 times faster on the 12-node than on the 8-node cluster. EsgynDB ran 1.4 times faster, or at very near linear scale. That indicates that as you grow your cluster you will see a commensurate performance gain, even for BI /Analytics workloads which can be notorious in not being able to scale linearly as versus OLTP workloads.
Of course, no one really runs a single stream of queries in typical BI and Analytics environments. So, we wanted to start demonstrating comparable performance with multiple concurrent queries running. To start we picked some 20 ad-hoc queries that were the fastest set of queries that Hive could run. We then scaled these queries from 1 stream, or user, to 20 concurrent users in steps of 5. We followed the TPC-DS guidelines on sequencing these queries in different order across those concurrent streams. Variations in performance from one concurrent set to another can be attributed to that different mix each time.
EsgynDB was 3 times to 5 times faster than Hive on any one of those runs. The more impressive aspect is that the slope of the increase in response times as the concurrency increased was 2.8 times more for Hive. This indicates that Hive is using far more resources to deliver the same performance as concurrency is increased.
EsgynDB can deliver more performance and higher levels of concurrency with the same set of system resources.
So why did EsgynDB do so much better …
Here is a summary of the comparative / differentiating capabilities that worked for the various queries in the benchmark (we have eliminated the details on which queries benefited from which set of features for brevity), in no particular order.
• EsgynDB’ s long heritage and decades of investment has led to the support of a broad range of ANSI compliant SQL functionality, enabling all 99 TPC-DS queries to be executed with few minor syntax modifications that are allowed by the specification and are ANSI compliant.
• Shared nothing architecture allowing for transparent scalability with increasing cluster or data size. Few tuning changes were needed between 1TB and 10TB scale factors.
• Cost based, top down optimizer efficiently explores a large search space of plan options for complex queries. Several queries have more than a dozen joins and several others have large aggregations. An optimal execution plan is necessary to execute such complex queries efficiently. All queries were compiled without query specific hints or settings.
• Flexible rule based optimization that can detect standard schema patterns such as Normal Form, versus Star and Snowflake schemas. TPC-DS uses a snowflake schema, with a query set that has both typical star schema queries and 3rd Normal Form queries.
• Equal height histograms maintained on all columns, and even combination of key columns, in every table. This gives the optimizer an accurate view of the data distribution for each column, or combination of columns. TPC-DS date dimension spans decades of dates, while the fact tables have data only for a few years. EsgynDB estimates the cardinality properly for join and other operators, utilizing the histograms and the column statistics collected from ORC data partitions, to estimate the number of rows matching a join of the fact table with the date dimension, based on a date range.
• Execution based on a data flow architecture, with different operations in a query fragment being dynamically scheduled. Data is exchanged between operators in a fragment via queues of pointers, without copying data. Between fragments, data is exchanged using TCP over the cluster network.
• Overflow of intermediate data to disk only if there is excessive memory pressure and not enough available memory for the operation. Joins and aggregations of large amounts of data can overflow to disk during query processing. Other database engines fail if there is not enough memory available for big memory operations, such as for large hash joins or sorts, or choose to materialize intermediate results to disk, because they do not have a dynamic model to use memory as much as possible, and overflow only on memory pressure.
• Each query is divided into several execution fragments based on complexity and data exchange needs, with one or more fragments executing in separate processes. Multiple fragments execute simultaneously on a node. This allows all cores present in a node to be used.
• LLVM (Low Level Virtual Machine) based native expression evaluation allows CPU intensive queries to be executed efficiently.
• Data in both dimension and fact tables is stored in a compressed, columnar ORC format, with evaluation of simple predicates pushed to the ORC subsystem such that IO is minimized.
• Horizontal partitioning of ORC data on the date column allows many partitions of fact tables to be pruned reducing IO for several queries. EsgynDB supports both static (based on literals in the query) and dynamic (based on values available only during execution) partition pruning.
• A scan process is assigned a portion of an ORC file – a certain number of stripes – to read, based on ORC statistics, such that:
o Each parallel scan processes equal amounts of data, so they all finish scanning at about the same time
o The data that the process is reading is local to the node that the process is running in, so that there is no expensive inter-node data exchange
• Nested joins are ideal for a star schema query that can benefit from random IO into a fact table. Nested join in EsgynDB is found to be very effective when joining the relatively small DATE_DIM table with a large SALES table, especially when the SALES table is partitioned on SS_SOLD_DATE_SK, and when the local predicate applying to DATE_DIM selects noncontiguous ranges of dates (such as all Mondays in 1999, 2000 and 2001).
• Both matching partition and broadcast joins are used extensively by the EsgynDB optimizer in this benchmark, to improve join performance while minimizing network traffic.
• Partial aggregation is used to reduce dataflow over the network, by doing a partial GROUP BY with available data in each query fragment. A later fragment will consolidate the partial aggregations for the final grouping. Partial grouping is used extensively by EsgynDB in this benchmark.
• Both Hash and sort implementations of GROUP BY are available. Sort GROUP BY is more efficient for aggregation, if data is already sorted. It also avoids overflow to disk.
• Correlated subqueries are transformed by the compiler so that hash joins can be used to evaluate the relation between a subquery and the outer query. This is a powerful technique, since correlated subqueries, where the subquery is expensive to evaluate, cannot be executed otherwise in a reasonable time.
• Aggregates in a window function can be executed in parallel, with any partitioning or ordering expression.
• Flexible degree of parallelism that varies based on query complexity, and even within a query based on the volume of data flow through an operation and the complexity of the operation, enables adequate resources to be provided to intensive queries, while lighter queries execute fast with far less resources. Thereby the system is shared very efficiently between multiple concurrent users.
• Min/Max optimization reduces the amount of data accessed by the scan of a large table in a hash join, by sending the minimum and maximum values of the join column. These are determined during execution by scanning the smaller table in the join first. These min/max values are also used to dynamically prune whole partitions of the larger table, if the join column is a data partitioning column.
• Executor processes which perform large scans are placed in the same node of the cluster as most of the data that is fetched by those scan is resident. This colocation substantially reduces inter-node network traffic.
• For a LIMIT clause, the compiler tries to synthesize a predicate, using histogram statistics that tells the expected cardinality for each operator, to limit the number of rows processed by every operator in the query to the number of rows specified in the LIMIT clause. If this were not done, then essentially the entire query is executed and the LIMIT is applied only to the rows returned by the query. This capability can potentially reduce the data flowing through the query significantly.
• LIMIT clause with an ORDER BY uses a Top N sort, which retains only the top N rows always during the sort. Therefore, it does not overflow to disk when there are many rows to be sorted.
• Queries with EXISTS/IN or INTERSECT use a GROUP BY followed by a join, when the GROUP BY results in a significant reduction of data.
• Scanning ORC files to collect statistics and location data can be expensive when there are many data files for each table. This causes the compile time to increase. Sampling of files and disk based caching of collected metadata is used to reduce compile time.
• Parallel insert into Hive ORC tables using multiple executor processes allows several partitions to be optionally created and inserted into with one INSERT statement.
• Data skew comes in multiple forms. The most common is the “most popular values” data skew, which happens when certain values, in a column used for joins or aggregations, occur significantly more often than other values. The values in join columns can turn out to be very skewed (i.e., a large percentage of rows have the same value). A good hashing algorithm evenly distributes the values across nodes. But if there is a disproportionate number of rows with the same value in the hash columns, these rows will end up on the same node for processing. One, or a few nodes, will be overworked, while other nodes will have little to do. The skew buster feature recognizes situations where data is skewed for certain operations of a query and distributes skewed values differently from the rest of the values, to ensure that data is evenly distributed over all processing nodes. So as can be seen from the above set of benchmark results EsgynDB can perform well at a mix of Hybrid Transactional/Analytical Processing workloads very effectively, with architectural capabilities identified as to why that is. There is more future work and other technology integrations planned that will substantially kick this into another level of unparalleled performance