Using Machine Learning Libraries in EsgynDB

//Using Machine Learning Libraries in EsgynDB

Machine Learning (ML) libraries are getting very popular. There is a wide variety of such libraries – Wikipedia names 49 of them [1]. These ML libraries need data, and often that is business data stored in RDBMSs like EsgynDB or stored in some other form in a Hadoop Data Lake.

Simple Integration – JDBC and HDFS

There are several ways to connect ML libraries and EsgynDB. One way is to use JDBC, a method supported in most software packages. The user runs the ML code on a system of their choice and reads data via JDBC. If there is a lot of data to transfer, we may want to read through more than one connection. Some systems can solve this by splitting the SQL query into multiple instances. For example, Spark JdbcRDD [2] allows parameterization of the query. This can work very efficiently with salted EsgynDB tables:

 

Apache Spark will start multiple instances of this query, using non-overlapping ranges of values.

The reverse, moving result data from the ML lib to EsgynDB, is usually done by exporting the data to a Hive table in HDFS. This Hive table is readable by EsgynDB.

Parallel ML Instances on the EsgynDB Cluster

In this blog post, we want to focus on another method of integration, one where we can run ML library programs in parallel on an EsgynDB cluster, using EsgynDB data as input and processing the result in EsgynDB as well. The general approach for this is the same for different ML libs and it is similar to MapReduce, with the ML program being structured into pieces that can act as mappers or reducers.

Three Building Blocks

To create such a setup, we combine three building blocks:

First, an ML program that takes a data frame as input and produces a data frame as output. This program is written like a mapper or a reducer in MapReduce, such that many instances of the program can run in parallel, without the need to exchange state information between the instances. For example, the program could perform a clustering algorithm on logically independent parts of the data.

A data frame[3][4] is a concept present in several ML packages, and it is roughly equivalent to a table in SQL. Data frames consist of named columns with a specified data type. They are also organized into rows, since every column has the same number of elements.

The second building block is a user-defined, table-valued function (TMUDF) provided by Esgyn for Apache Trafodion [5][6] that integrates the ML program with EsgynDB. It feeds the required data in text form to the standard input of the program and receives the result in text form from the output of the program. Many ML libraries have built-in support for reading and writing data frames in comma-separated (or otherwise delimited) format.

Finally, we add a driving SQL query to complete the ML scenario. This driving query produces the data needed for the ML program, invokes the UDF, and processes the result of the UDF in the way desired by the user. When the ML program acts like a reducer, the driving query also defines how to partition the data. In cases where the ML lib has no native support for scale-out clusters (e.g. R), this allows you to parallelize the algorithm and to run on a much larger data set than what would otherwise be possible.

Example: TensorFlow Neural Network

In this example, we took a neural network example from Google TensorFlow [7] that recognizes hand-written digits. We trained this network on a separate machine and then deployed it to EsgynDB, where we used it to read digits in parallel. This is what the driving SQL query looks like for this example:

 

The integrating UDF is called “filterprog”, its call is shown in red. The blue part defines the input data, vectorized images. The query returns the image numbers and reads digits as the result.

The TensorFlow program reads image numbers and image data from stdin, performs image recognition, and then prints the read digits, along with the image numbers, to stdout, where they are picked up by EsgynDB. The EsgynDB optimizer parallelizes this query as needed.

Note that this operation can be integrated into a more complex SQL query, with joins, unions, aggregations, etc. being performed before and after the ML code. Multiple ML steps can also be combined into a single driving query.

References:

[1] Wikipedia Machine Learning:
https://en.wikipedia.org/wiki/Machine_learning#Software

[2] Spark JDBC RDD: https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/rdd/JdbcRDD.html

[3] Data Frames in R:
http://www.r-tutor.com/r-introduction/data-frame

[4] Data Frames in Spark:
https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes

[5] Trafodion Table-valued UDFs: https://cwiki.apache.org/confluence/display/TRAFODION/Tutorial%3A+The+object-oriented+UDF+interface

[6] Integrating UDF (FilterProg): https://cwiki.apache.org/confluence/display/TRAFODION/Trafodion+Code+Examples#TrafodionCodeExamples-FilterProg

[7] Google TensorFlow MNIST example: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros

2018-04-18T15:40:07+00:00

About the Author:

Hans is a senior software engineer at Esgyn.