Often, people ask us how Apache TrafodionTM (incubating) stacks up against databases like Oracle, IBM DB2, Microsoft SQL Server, Informix, MySQL, PostgreSQL, Teradata, and other relational databases on the one hand, and against SQL-on-Hadoop solutions like Impala, Tez, Hive, Drill, Presto, and others, on the other.
Apache Trafodion is a world class database and stacks up very well against the relational databases listed above. Each database has features and capabilities that another does not have. If you were moving from one database technology to another, then you would have the same challenges as you would moving to Trafodion, unless you religiously use only ANSI SQL capabilities. But, if you were developing a new application on Trafodion, then you would find that it supports all functionality typically needed of a database.
Compared to other SQL-on-Hadoop technologies Trafodion has a superior database engine, with a lot more features and capabilities. The only difference is in the kinds of workloads that Trafodion is currently optimized for. This is not a lack of capability. Instead, the current focus of the project using Apache HBaseTM is more suited for operational workloads. The same database technology, running on a different storage engine, is the power behind a huge EDW for HP that comprises of three 250TB systems running tens of thousands of ELT and reporting queries a day, hundreds of them running concurrently.
But there is no simple answer. I hope that this post can throw some light on why you would use Trafodion.
Workloads from OLTP to Analytics
Enterprise IT database deployments vary in the kinds of workloads they handle. One simplification has been to classify these workloads as Operational, Business Intelligence, and Advanced Analytics. Operational usually implies Online Transaction Processing (OLTP) or Operational Data Store (ODS). To simplify the terminology, though there are numerous exceptions to any such simplification, operational workloads help run the day to day operations of the company, while BI and Analytic workloads help drive more efficiency or improvements in the business towards more customer satisfaction, product and service improvements, higher revenue, better profit margins, more growth, amongst other things. These workloads have been hosted by different software and hardware architectures, because of the various differences between these workloads, and the optimizations possible for specific workloads. One can argue where the lines are drawn between these workloads, but the differences between them could probably be categorized as follows:
Given this spectrum of workloads the typical deployments we have seen in the past are as follows:
Most operational deployments are on SMP architectures with shared cache and disk. There are exceptions such as NonStop with its scalable MPP architecture. From a database perspective you typically see a relatively normalized data model deployed on Oracle or Oracle RAC, IBM DB2, Microsoft SQL Server, and open source databases, such as MySQL and PostgreSQL. The applications hosted in these environments are OLTP or mostly transactional in nature.
The data is then replicated to the Business Intelligence environment where it is transformed to a dimensional model. Because the data is historical and parallel access is key to good performance, these systems are generally deployed on MPP architectures such as Teradata and Oracle Exadata.
Operational Data Stores are somewhere in between OLTP and BI. They could be deployed on either of these sets of platforms, depending on the nature of the workload that will be hosted by the ODS. One can characterize the difference between ODS and BI workloads in that ODS workloads involve queries that are specific to a customer, supplier, product, etc. So in that sense they are a lot more contained in their access to the database. Whereas BI workloads mostly cross specific customer, supplier, and product type boundaries, accessing a lot more data on multiple dimensions, with more complex queries than a typical ODS may see.
From BI then if you want to do advanced analytics such as applying statistical analysis, data mining algorithms, machine learning algorithms, time series analysis, sessionization, etc., data is replicated, and perhaps aggregated, to analytic platforms optimized for such workloads. These are MPP deployments and offer columnar storage, pushing parallel analytics into the database, and other capabilities to facilitate advanced analytics. Examples of these are Vertica, Aster Data, Greenplum, and Netezza.
BI and Analytics for Big Data
With the advent of Big Data, a variety of data, such as external social or IoT (Internet of Things) data, has become an integral part of Enterprises, providing further insights to improve business performance. This data can be structured, but is increasingly semi-structured or unstructured in nature. This could include text messages, Twitter messages, emails, audio or video files, or data from various devices. This data is increasingly being hosted on Hadoop platforms, that have an MPP architecture, are elastically scalable, with replication for high availability. Data maps to MVCC (Multi-version Concurrency Control) or append only Hadoop File System (HDFS) structures.
BI type reporting and analytics workloads have been the main focus of Hadoop deployments so far. These have been mostly complementary in nature to what has been running on the BI and Analytics platforms discussed earlier. The external data is generally complimentary to the structured data on those platforms. The structured data is generated within an Enterprise, generally emanating from the operational systems. In this area players such as Hortonworks with Apache Tez, Apache Hive, Cloudera with Impala, MapR with Drill, Presto, and Databricks with Spark are offering various capabilities, leveraging underlying structures such as JSON document stores, Avro, ORC Files, and Parquet (the last two being columnar stores).
But as the Hadoop platform and these tools have matured over time, the need to integrate some of the structured data from the existing BI and Analytics platforms, with the data on Hadoop, to gain better insights has increased. Because of the lower Total Cost of Ownership (TCO) of Hadoop and open source deployments, some of these workloads are being offloaded from the typical BI and Analytics platforms onto Hadoop.
OLTP and ODS now on Hadoop
But what about operational workloads? Many operational workloads, be they transactional workloads at web-scale, or operational data store workloads, can also exploit the Hadoop platform for lower TCO, elastic scale, and replication for high availability. There are Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), Manufacturing Resource Planning (MRP), Supply Chain Management (SCM), Financial Resource Management (FRM), and Human Resource Management (HRM) systems, which often need to scale beyond SMP systems at a lower Total Cost of Ownership (TCO).
Hadoop would help extend the capabilities, or offload workloads from these mission critical environments. Moving less mission critical OLTP workloads / applications to Hadoop would contain the growth of these proprietary deployments, and guarantee better SLAs for mission critical applications running in those environment. On Hadoop an operational data store deployment can also provide customers access to more historical data than may currently be available to them. Over time, as customers become more comfortable, they can start deploying mission critical applications on Hadoop as well.
That is the workload Apache TrafodionTM (incubating) is optimized for. Trafodion provides full ACID transactional capabilities across multiple rows, table, and statements. It provides extensive ANSI SQL support, so that you can leverage current SQL skills and tools, and make it easier to transform and modernize current applications, or build new ones. It has a very sophisticated database engine to handle workloads that span transactional updates, operational queries, all the way to reporting workloads leveraging Apace HBaseTM as the storage engine.
Apache TrafodionTM value proposition
You might consider transforming or modernizing your application in order to provide more capabilities, and a richer interface, to your customers or internal business users, by more easily integrating semi-structured and unstructured data with your structured data, since all the data is on the same platform. Also, as soon as the data is captured in Trafodion, it is available with no latency for reporting and analytics by leveraging the entire Hadoop eco-system. For closed-loop analytics you can generate analytical models and then deploy them in applications hosted by Trafodion. There is no replication of data necessary from other platforms. The reference data can be shared across operational, historical, and analytics workloads. BI and Analytics can gain more insight since the operational dimension of the data is available, as is the historical and external context. It also provides the ability to quickly take operational actions based on the intelligence derived from BI and analytics. This is over and above the elastic scale and the low TCO benefits inherent in Hadoop.