Parallelization of ML program
In a large cluster hosting EsgynDB, the ideal integration with an ML program occurs when the program is parallelized, and the power of the cluster is utilized for optimal efficiency. EsgynDB provides the input data to the ML program and the result is processed in EsgynDB as well. This approach works for most of the common ML libraries such as TensorFlow, R, and Spark.
First, develop an ML program that takes a data frame as input and produces a data frame as output. This program will be conceptually like a mapper or a reducer in MapReduce, such that many instances of the program can run in parallel, without the need to exchange state information between the instances.
For example, the program could perform a clustering algorithm on logically independent parts of the data.
Add to the ML program a user-defined function (UDF) in EsgynDB that will drive it. It feeds the required data in text form to the standard input of the program and receives the result in text form from the output of the program. Many ML libraries have built-in support for reading and writing data frames in comma-separated (or otherwise delimited) format.
Finally, add a driving query to complete the ML scenario. This driving query produces the data needed for the ML program, invokes the UDF, and processes the result of the UDF in the way desired by the user.
In cases where the ML lib has no native support for scale-out clusters (e.g. R), this allows you to run on a much larger data set than what would otherwise be possible.
 A data frame is a concept present in several ML packages, and it is roughly equivalent to a table in SQL. Data frames consist of named columns with a specified data type, and can also be viewed as rows.