One of the great strengths of the Apache Hadoop™ ecosystem is that it glues together diverse technologies to solve an unlimited set of big data problems. Gluing things together well requires attention to ease of use and how fast and efficiently the pieces can exchange data.
EsgynDB™, a web-scale enterprise SQL-on-Apache Hadoop™ solution from Esgyn Corporation, now supports a tight integration Apache ORC™ files. In this blog post, I’ll describe what benefits come from marrying EsgynDB and Optimized Row Columnar (ORC) format files, and then explore two important use cases this marriage addresses.
EsgynDB™ is a highly scalable SQL engine based on Apache Trafodion™ (incubating). The latter is a highly scalable enterprise-class database engine that was open-sourced by Hewlett-Packard in 2014. It has over 20 years of research and development behind it in both online transaction-processing and data warehouse markets, with very sophisticated query optimization and run-time technology.
All database engines rely on an underlying storage engine layer, where tables are stored. Since EsgynDB’ s debut in 2015, it has used Apache HBase™ for its storage engine. Support for other storage engines was gradually added, for example Apache Hive™ text and sequence files. This support now includes ORC files as well.
The evolution of ORC is an interesting story. It is an outgrowth of the Hive project. Hive itself came about because of the realization that writing Java programs to execute MapReduce was a very inefficient way to perform ad hoc query processing – inefficient from a developer or user perspective. Hive implemented a subset of the SQL language, with its own twists that showed through the details of the storage engine underneath. This kind of integration allowed users used to dealing with those details a way to continue doing that, while eliminating much of the tedium of Java programming.
Originally, Hive used simple text files as its storage layer. It soon became apparent that improvements in performance could be had by creating more optimized kinds of file formats. One attempt resulted in Record Columnar files (RC Files). ORC was developed as an improvement on RC Files.
ORC is designed for analytic query processing. It is a “row columnar” store; that is, ORC files have a two-level organization. An ORC file is broken up into stripes, each containing a collection of rows. Within a stripe, columns are stored separately. Each column is compressed using a technique appropriate to its data type. For example, integer columns use run-length encoding, while string columns use a dictionary technique. An index structure permits efficiently skipping some values. Each stripe knows the minimum and maximum values for a particular column, allowing stripes to be skipped for some queries.
The library one uses to interface to ORC files supports predicate pushdown, which is a time-honored technique for efficient query processing: it moves the logic as close to the data as possible.
ORC also supports partitioning. It uses the same partitioning scheme as other Hive file formats: One defines one or more columns as “PARTITIONED BY” columns. A different ORC file is created for each distinct value of the partitioning columns. The column values are stored in the file name of a given partition.
These capabilities – columnar store, row striping, sophisticated compression, predicate pushdown and partitioning – make analytical query processing very efficient: A sophisticated query engine can read just the rows and columns needed and do so quickly.
EsgynDB is just such a query engine. EsgynDB descends from a long line of database engines with advanced parallel query support. EsgynDB already handles partitioning, for example with the HBase storage engine. With ORC support, code has been added to take advantage of ORC partitioning structure to eliminate unnecessary partition access at both compile and run-time. Further, EsgynDB handles the remaining partitions in parallel. EsgynDB’ s predecessor products supported storage engines with predicate pushdown, so adding predicate pushdown to ORC was an easy extension to EsgynDB’ s architecture. EsgynDB already has notions of selecting only the columns that it needs, so it is easy to pass that information down to the ORC level, taking advantage of ORC’s columnar structure. Compression is something that EsgynDB can take advantage mostly for free; the only concern an ORC client may have is in computing the cost of an ORC access, which is affected by the compression.
Put simply, EsgynDB and ORC files are a natural, easy fit with each other.
You might ask, why not just use the Hive query engine? The answer: You certainly can. However, the Hive engine does not have near the maturity that the EsgynDB engine delivers. There are two aspects to this. One is that Hive is not a standard SQL implementation. Many aspects of its syntax are non-standard, showing details that a sophisticated engine would hide from its users. This reduces portability. Too, its syntax is not complete; there are many ANSI features that it lacks. The other aspect is performance. Hive’s run-time engine and optimizer have great room for improvement.
To drive home these points: Esgyn recently ran an internal benchmark comparing its performance with Hive’s on ORC files. The benchmark used the TPC-DS database and query set. EsgynDB can run all 99 queries, with either no modifications or minor ones that are both ANSI compliant and permitted by the TPC-DS specification. Hive can run only 65 of the queries. Comparing performance of just those 65 queries, EsgynDB ran five times faster.
Now that I’ve explained some of the benefits of combining EsgynDB and ORC, I’d like to describe two use cases that this combination enables: Hybrid Transactional/Analytical Processing (HTAP) and Multi-Temperate Data.
According to Wikipedia, HTAP is the ability of a single database that can perform both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). The idea is to support transaction workloads on the same platform as analytical processing. This can be very tricky. Some of the issues involve the storage engines, and some involve the database engine itself.
In the storage engine area, supporting these mixed workloads is a matter of picking the right tool for the job.
For transactional workloads, you need a storage engine highly optimized toward small accesses, a few rows at a time. You need to be able to insert or modify or fetch a few rows out of one or more very large tables. This requires some sort of highly efficient index structure. You also need to be able to do this using ACID transactions. So, you need a key-value store with an integrated transaction engine. Apache Trafodion (which EsgynDB is based upon) provides this: Trafodion tables are stored in HBase, and the Trafodion transaction engine is integrated into the HBase Region Server processes via HBase co-processors.
For analytical workloads, you need a storage engine that can efficiently access massive amounts of data from large tables, on a row or column basis. As we already discussed, ORC provides this.
I should note that there have been attempts at a one-storage-engine-fits-all approach. These attempts so far have compromised efficiency on either the transactional or the analytical side, a best-of-neither result. Nevertheless, should a successful approach emerge, EsgynDB would be poised to take advantage of it.
In the area of the database engine itself, to support HTAP, one needs a query engine that supports fast paths when just a few rows are being directly accessed, and analytical paths that can process large numbers of rows efficiently. Moreover, one needs a query optimizer that selects the appropriate paths based on the specific characteristics of the query. This is something EsgynDB supports very well; it benefits from over 20 years of research and development on itself and its predecessor products in this regard.
So, one can store transactional data in HBase, and analytical data in ORC, and use EsgynDB to query and manage both. Moreover, EsgynDB can combine the two: A given query can reference both Trafodion (and native HBase) tables as well as ORC tables.
“Multi-temperate data” refers to the idea of a collection of data where some of the data is accessed frequently (“hot data”) while the rest of the data is accessed rarely (“cold data”). Typically, there is much more cold data than hot. An example might be transactional data accumulated over time. A SALES table might be very hot when it concerns sales made today, warm concerning sales over the last week or month, and cold when it concerns sales made over the last year or longer. Another example might be log data. Recent events are referred to frequently while older events are queried for forensic purposes or to spot long-term trends.
To support hot data, one needs a storage engine that tends to keep hot data cached in memory. HBase does this. To support cold data, one needs a storage engine that can keep large amounts of data organized by temperature (the “temperature” often being some sort of date field). To support both well, one needs an efficient mechanism to move cooling data from one storage engine to another, incrementally, and needs a temperature-aware query engine that accesses data in a manner optimized to its temperature.
HBase is a great storage engine to store hot data. HBase keeps the newest and the most frequently accessed data in memory, in its “memstore”. As data begins to cool, HBase writes it out as sorted log files, which are combined and compacted via background processing. As data cools further, one can take ranges of it via EsgynDB, moving it from HBase to ORC. This can be done using INSERT/SELECT statements (or variants such as UPSERT USING LOAD). The EsgynDB engine generates plans with a suitable degree of parallelism to move such data quickly and efficiently.
From an application perspective, one can define a view within EsgynDB that incorporates both the hot and cold data. End users can simply query the data, without regard to which storage engine contains it.
So much of the success of Apache Hadoop has come from the integration of the right tools for the right job. EsgynDB, with its tight integration with HBase and with ORC, has what it takes to become the premier query engine for transactional and analytical workloads in Apache Hadoop.
To learn more, please contact us at any time.