How to Design for IoT, Log and Event Data?
A common use case for Apache Trafodion in the IoT (Internet of Things) space, in telecommunications and internet security is a very large single table, recording real-time events. Customers want to ingest new data at a fast rate, perform queries and age out obsolete data.
What we typically recommend for this situation is a design pattern consisting of three elements: Salting, Divisioning, and Stripe Compaction.
The first element, salting, creates an even distribution of data across the cluster. Not only are the overall data evenly distributed, the hot (most recent) data is spread evenly across the nodes in your cluster by salting. Salting is hash-based. It applies a hash function to compute a region number for each row. Typically this is based on one or more columns used in operational queries, such as customer id or device id.
Apache Trafodion automatically manages the salt. It computes the hash function and also generates predicates for the salt column automatically. SQL insert, select and delete statements don’t need to do anything special.
Now that the data is evenly distributed, we are faced with the next problem – choosing a good row key. To make querying, aging and compaction easier, we want new rows to be appended to the end of each region, but not randomly. Also, we want to be able to query time series data on date ranges, along with other important columns, such as a client or device id.
To achieve both of these goals, we introduce another key prefix, right after the salt – a division id. A division is a time range, such as a day, week or month. This groups together all the rows in that time range and allows us to choose a column like customer id as the leading primary key column.
A typical operational query in such scenarios is something like this:
select * from t
where cust_id = x and transaction_timestamp between y and z
Salting on cust_id makes sure all the rows for customer x end up in the same HBase region. Apache Trafodion makes sure the query only accesses that region server, by adding a predicate for the salt column.
Divisioning of the columns ensures that we need to read only data in the relevant time range or ranges, and not years of historical data. It also enables us to use cust_id as the leading primary key column, so that we can quickly hone in on data for our customer amongst the data for millions of customers.
Sometimes we want to query multiple customers and multiple days of data. Trafodion’s unique Multi-Dimensional Access Method (MDAM) makes that efficient, by splitting the more complex predicates into scanning only the pertinent ranges, skipping the unneeded data in-between.
So, we have shown that salting and divisioning make it easy to insert data into a time-structured table, as well as allow queries to access just the needed data for specific time range(s) and key column(s).
There are two more problems looming, though: Aging of data and HBase major compactions. For this, an HBase feature called stripe compaction can be very useful. It splits the data in an HBase region into multiple stripes, each for a key range (let’s say a month of data per stripe). Files get compacted only within these stripes. That means that months of historical data that don’t change anymore don’t have to be rewritten by compaction repeatedly. They stay around until they are aged out (i.e. deleted). At that point, some “stripes” become empty and are combined with newer non-empty stripes.
In summary: Apache Trafodion and HBase offer three powerful design elements to let you store very large amounts of time-based data in a table. Salting, divisioning and stripe compaction are transparent to SQL queries and allow efficient ingestion, querying and aging of data.