People often ask how Trafodion is different from other SQL-on-Hadoop solutions. In a previous blog “Operational workloads now on Hadoop”, I explained how Trafodion focuses on operational (OLTP and Operational Data Store) workloads, and not as much on Business Intelligence or Advanced Analytics workloads. In this blog, I cover how Trafodion is technically different from other SQL-on-Hadoop solutions, regardless of the workloads they target.
In this post, I discuss the four major ingredients in the recipe for a world-class database, and demonstrate how Trafodion measures up on those. You can then assess how other databases stack up against this world-class database.
Time, Money, & Talent
Take any world class RDBMS out there that is worth its salt, such as Oracle, SQL Server, DB2, or Teradata. It took these companies decades of person years, millions of dollars, and a boat load of database talent to build their database engines. Tandem-Compaq-HP invested 20+ years and upwards of $300 million in the IP that has ultimately resulted in Trafodion. This is a world-class database engine that has competed with the likes of Oracle and Teradata as NonStop SQL/MX and Neoview in its previous incarnations.
More importantly, it is very hard to find software engineers who really know how to build highly available, scalable, database engines. Most of our database development team started working on databases in the 1980s and have lived and breathed databases since. We grew up on a massively parallel shared-nothing architecture with a single system image that spanned clusters. Parallelism, scale, and mission critical High Availability flows through our veins. We have 300+ years of database development under our belts. The database was built for OLTP as an MPP engine ground up in 1987 as NonStop SQL. Ever since we have rewritten it and tuned it for both OLTP and BI workloads, as we demonstrated with Neoview. This team of developers really knows how to build rock solid enterprise level MPP databases. The world of NoSQL and Hadoop only recently woke up to the benefits of running on a shared nothing MPP architecture. It is amazing that we were able to convince HP to open source this incredible IP, providing Trafodion an unfair advantage over the competition. Trafodion is a helluva deal!
The time, talent, and money invested shows up in the functionality supported by the product in its ANSI support, ANSI and non-ANSI functions supported, performance, scalability, concurrency, throughput, stability, high availability, transactional, and myriad of other capabilities across a multitude of workloads, not often visible by just comparing syntax support.
You want a solid team behind the product you are using.
World Class Query Optimizer
Any database worth its salt has to have a solid optimizer. Without that, the simplest of queries can sometimes chew up your entire cluster, leave alone complex queries or 200 way joins that this optimizer has seen and optimized to beat Teradata in a POC. Just building a good optimizer is not enough. It has to be exposed to all kinds of different enterprise class workloads so that it can be tuned to handle the complexities and differences that are inherent in those workloads. There is no question that Trafodion has one of the best optimizers in the industry. We have been in conferences on optimizers and it has been clear that is the case.
I can talk all about Cascades, which our optimizer is based on, that it is a cost based rules driven optimizer, and it has a list of patents to its credit, etc. However, we added a number of capabilities over the years where the optimizer reduces the search space of plans by recognizing star joins for instance. By doing so, it can immediately eliminate a number of plans and hone in on doing a cross-product of the dimensional tables and then doing a nested join into the fact table. This is just one example of what we call Large Scope rules.
The compiler considers various join strategies. Nested or nested cache joins are favored in operational queries and merge and hybrid hash joins are leveraged for larger more complex reporting queries. It considers serial plans and parallel plans based on cardinalities. We gather histogram statistics to understand the cardinalities well enough to generate good plans. These histograms are equal height histograms that tell you much more about skew than equal width histograms do. We have a patented feature called Skew Buster that eliminates skew at any point in the execution tree. Skew kills clusters. You can have a query with a skew that can peg one node thereby affecting all queries running on that cluster. It takes as long as that one node takes to finish the query, while the rest of the nodes have completed their part and are sitting idle. Skew buster eliminates that problem.
The optimizer can:
- Un-nest subqueries
- Convert correlated subqueries into joins
- Push predicates to the lowest point in the execution tree they can be evaluated (now termed LLAP in the industry as something new in the Hadoop world)
- Optimize inner, left, right, and full outer joins
- Use a patented Multi-Dimensional Access Method (granddaddy of skip can) when needed
- Do eager aggregations to reduce message traffic
- Do hash Group Bys to avoid sorts
- Consider a number of sort avoidance strategies
- Leverage patented query plan caching techniques to avoid recompiling the query
- Myriad of other capabilities too numerous to enumerate
The Trafodion optimizer and database engine has been exposed and tuned for varied enterprise class workloads. The way to prove an optimizer’s capabilities would be to provide the kinds of varied workloads that trip optimizers. I am not talking about the TPC benchmark queries, since all databases run those and therefore have tuned their optimizers to handle them. The Trafodion optimizer will hold up very well against any other world class RDBMS in generating efficient plans for a wide variety of workloads.
World Class Parallel Data Flow Execution Engine
Generating a good plan is one thing. However, without a sophisticated and efficient execution engine backing that up, it does not matter how efficient the plan is that the optimizer generates. We have had a parallel data flow execution engine right from the outset. You might have heard the Spark, Tez, and Flink people talk about DAGs and the data flow engine they have created in recent years. We have had this engine since the inception of NonStop SQL/MX in the late 1990s. The beauty of Trafodion’s implementation is that unlike MapReduce, data flows through in-memory queues between each operator. This allows each operator to run in parallel independently, called operator parallelism, while the pipelining of data from one operator to another is called pipeline parallelism. This is besides the data partitioned parallelism that Trafodion supports, leveraging the salting capability that was added to distribute data across the regions, and therefore I/O, in a balanced way. No intermediate results are materialized except for blocking operators like sort. Even in that case, data is not written out to disk if the sort can be done in memory. Data also hits the disk if a hash join encounters very large sets of data it has to process and encounters memory pressure where it needs to overflow to disk. It does so gracefully, and manages memory efficiently.
The executor can execute OLTP queries that are accessing a single row via a key, or a group of rows clustered together on a single region by using a serial plan. It might even use a serial plan to access multiple regions across multiple region servers if the expected cardinality of the rows is small. Even when there is a large number of rows to be processed across regions with predicates and/or aggregations pushed down to coprocessors, the master executor may talk in parallel to each region server to process the query. But in some cases where there are multiple joins, potentially on non-collocated columns (i.e. the primary key of one or both the tables being joined are not the same as the join key), and a large amount of data is expected to be processed, then it starts up multiple Executor Server Processes (ESPs), to help execute the query in parallel. Each ESP works on a part of the data in the table, in parallel with other ESPs. There could be multiple levels of ESPs involved depending on the table’s clustering key and the join columns, and the number of joins in the query.
In the process of doing this, it chooses various strategies to exchange data between ESPs, called shuffle in the industry. It can do hash or random repartitioning or broadcasting of the inner or outer table to perform the join, or even a combination strategy as in Skew Buster.
The number of ESPs used to execute an operation, the degree of parallelism, is a function of the cardinality expected at each operation. So the optimizer chooses the most frugal strategy from a resources perspective to execute each operation in the query. In fact, based on the overall cardinality expected it may determine a maximum degree of parallelism for the entire query. This is patented technology called Adaptive Segmentation. It uses only the number of nodes required to execute the query, instead of using every node to execute every query and operation. This not only results in the least amount of memory resources being used but also substantially improves concurrency. It allows more queries to run concurrently on the cluster. It also provides a higher level of resilience to failure. A node failure has an impact on a far fewer set of queries than if all parallel queries were using all the nodes to execute. In other words, it leverages your hardware investment as efficiently as it can.
The Trafodion engine does all of the above, while:
- Enforcing data types and referential, unique, and check constraints during updates to ensure the integrity of the data
- Enforcing Grant/Revoke security, so only authorized users can update or access data
- Executing fast paths for OLTP versus reporting workloads
- Pre-fetching data when large scans are detected, in order to increase the parallelism between data access and the processing of data by the engine
- Efficiently evaluating expressions using pcode and LLVM to speed-up processing
- Myriad of other capabilities too numerous to enumerate
In a world class database engine you need a parallel infrastructure similar to what the ESPs mentioned above, provide. In this day and age the lines between OLTP and reporting workloads is getting increasingly blurred and most applications need the breadth of capabilities that Trafodion has, in order to handle a mixed set of workloads.
World Class Distributed Transaction Management System
The last ingredient, but not the least, Trafodion leverages a completely distributed transaction management system. OMID, and other HBase transaction managers similar to that, have a single transaction manager process coordinating transactions, and all the data-context related to those transactions, across the entire cluster. This is just not a scalable architecture implemented on an elastically scalable Hadoop cluster. This means there is a huge amount of messaging traffic between regions and this transaction manager, bottlenecked on a single node managing this entire context. A failure of the transaction manager or the node it runs on, will affect every transaction on the cluster. Also, this architecture cannot scale to handle hundreds or thousands of nodes or long running updates.
On the other hand, we have leveraged our previous IP, along with some HBase code that was not being leveraged, to implement a completely distributed transaction management system that is completely scalable. There is a transaction manager on each node, thus distributing the transaction coordination to the node initiating the transaction (we balance transaction workloads across nodes). On a 100 node cluster there are 100 transaction managers handling the transaction coordination workload in parallel. A single transaction manager is not managing transactions across all 100 nodes. The context for the transaction is managed in the HBase coprocessor at the region level. Each region manages transactional updates and conflict resolution of data at its local level. All transactional updates across all regions are not sent to one single transaction manager to collate and to resolve conflicts.
So, Time, Money & Talent, Sophisticated Optimizer, Parallel Data Flow engine, and a Distributed transactional management are the initial ingredients to build a world class database. We then garnished it with a huge number of capabilities. We have turned on many of those in our initial releases focused on operational workloads. Now you will see us turn on many more as we move forward.
There is another ingredient in the recipe that can be relevant in some comparisons.
Trafodion is open source. There are many benefits to being open source. The biggest being community involvement in its development. With participation from customers and vendors using Trafodion in their offerings, development is accelerated. Users can customize the solution for their own need and don’t have to wait on us, or the community, for capabilities they need. They are in charge of their own destiny. The community, made up of users, has a lot of say in the direction of the product.
Certainly pricing could also be a factor, since open source products follow the open source model for pricing, similar to say Cloudera and Hortonworks. There is no initial licensing fees. Esgyn will have an Enterprise version of Trafodion that will be thoroughly tested for integration with various OSs and distros, for High Availability, and for performance, along with general Quality Assurance. Fixes related to these will ultimately be contributed to Trafodion. Esgyn will charge for support on a per node basis, and any services and training the customer needs.
Certainly, you can Google “benefits of open source” and you can assess for yourself what the advantages or even disadvantages of open source software may be for you.