Artificial Intelligence in EsgynDB

//Artificial Intelligence in EsgynDB

Overview

Machine Learning (ML) libraries are being increasingly used in a variety of scenarios to harness the tremendous power of and advances made in Artificial Intelligence (AI). In a business context, there is a natural synergy of ML libraries with the business data stored in a database, such as a Hadoop Data Lake managed by EsgynDB. This guide explains how EsgynDB supports AI.

Integration Options

There are multiple ways to connect an ML library with EsgynDB –

  1. ML program consumes EsgynDB produced data

Data from EsgynDB, generated either by EsgynDB queries or hosted in the database, represents the input to the ML program. This may be through standard interfaces such as JDBC or ODBC, and either consumed serially, or in parallel via parameterized queries.

  1. ML program and EsgynDB interact via Hadoop

In this scenario, the ML program outputs its data to HDFS or Hive in order for it to be consumed by EsgynDB, or vice-versa.

  1. EsgynDB drives ML program

The ML program is embedded in a parallel EsgynDB query which runs on each node of the EsgynDB instance. This option leverages the advantages of the EsgynDB platform of parallelizing the work.

The present article focuses on option 3.

Parallelization of ML program

In a large cluster hosting EsgynDB, the ideal integration with an ML program occurs when the program is parallelized, and the power of the cluster is utilized for optimal efficiency. EsgynDB provides the input data to the ML program and the result is processed in EsgynDB as well. This approach works for most of the common ML libraries such as TensorFlow, R, and Spark.

First, develop an ML program that takes a data frame[1] as input and produces a data frame as output. This program will be conceptually like a mapper or a reducer in MapReduce, such that many instances of the program can run in parallel, without the need to exchange state information between the instances.

For example, the program could perform a clustering algorithm on logically independent parts of the data.

Add to the ML program a user-defined function (UDF) in EsgynDB that will drive it. It feeds the required data in text form to the standard input of the program and receives the result in text form from the output of the program. Many ML libraries have built-in support for reading and writing data frames in comma-separated (or otherwise delimited) format.

Finally, add a driving query to complete the ML scenario. This driving query produces the data needed for the ML program, invokes the UDF, and processes the result of the UDF in the way desired by the user.

In cases where the ML lib has no native support for scale-out clusters (e.g. R), this allows you to run on a much larger data set than what would otherwise be possible.

[1] A data frame is a concept present in several ML packages, and it is roughly equivalent to a table in SQL. Data frames consist of named columns with a specified data type, and can also be viewed as rows.

Examples

In the following examples, input data is produced in parallel and fed into a parallel EsgynDB TMUDF. The TMUDF sends the data as delimited text to the stdin of the Analytic Function (AF) program, a separate process. Output is received in delimited text form and produces a relational result.

Code for these examples is available from Esgyn on request.

Map Starbucks locations using R

This example uses an R program running in parallel on the nodes of an EsgynDB instance to map the locations of Starbucks stores by country. It demonstrates two aspects:

  1. a) Access to EsgynDB data from an R program running anywhere.
  2. b) Running an R program in parallel on an EsgynDB cluster, using a UDF.

Machine vision using Google TensorFlow

This example uses the Google TensorFlow library to train and recognize images of hand-written digits. Images are based on the MNIST data set[1] and stored in EsgynDB tables.

2019-10-22T17:47:13+00:00

About the Author: