Choosing the Right SQL Engine to Replace MapReduce Jobs

//Choosing the Right SQL Engine to Replace MapReduce Jobs

Everybody’s Doing It

Moving away from MapReduce has become the trend in a quest for reducing the complexity of building and maintaining MapReduce jobs and increasing performance, while leveraging existing IT resources. The move from MapReduce has many angles to it and should be treated as a strategic decision on how to move away and what should replace MapReduce jobs and for what workloads, while considering the strategic role of Hadoop to enable data monetization for an enterprise.

MapReduce is a critical component in any Big Data initiative as it holds key to access to the data stored in Hadoop File System (HDFS). It means access to data is gated by the availability of expensive data scientist who have insights into the data and data engineers who can write MapReduce jobs. Also as the new data is being ingested continuously, MapReduce jobs have to be constantly modified which is a major maintenance expense for that one has to keep fleet to engineers.

 

Hence MapReduce stands between the data in Hadoop and the users and apps that can leverage the data, thereby becoming a constraint in realizing and accelerating ROI on Hadoop initiatives.

Watch our Webinar Archive

Choosing Alternatives to MapReduce is not Just a Technology Decision

  • Need to do more with Big Data and Hadoop – as analytics and unstructured data continue to be the norm on Hadoop for now, there is a strong desire within the enterprise to find more uses for Hadoop deployments in order to justify the ROI. This makes an enormous sense as enterprise data lakes are being built and data governance policies are established. It is natural for different lines of business as well as enterprise IT to wish for access to the data lake.
  •  

  • Cost of Big Data initiatives – Even though most Big Data platforms and add-on tools are open sourced, and commercially supported versions of the same are at a fraction of the cost of their proprietary counterparts, it is important to see the major cost considerations going into Big Data initiatives.
    • Cost of data engineers and scientists – Data engineers and Data scientists are in short supply and costly. If you are looking for an experienced data team, then the costs are even higher.
    • Cost of experimentation – Due to the fragmentation of Big Data technology and with many technologies sounding the same in their capabilities and claims, the cost of experimentation is skyrocketing. Many developers jump into testing new technologies to solve a specific problem, without necessarily thinking about the bigger picture. If the new technology works functionally, it often fails to perform at larger scale.
    • Cost of scale – while Hadoop and its cloud cousins like Amazon Web Services (AWS) and Microsoft Azure HDInsight are built for scale, over time the cost of scaling is proving to be higher overall, as there are more pieces to the Big Data puzzle that need to be accommodated with dedicated computing resources.
  •  

  • Migrating existing IT applications to Big Data and Data Lake – At the moment, enterprises are moving data from Data Lake to a RDBMS to deliver reporting applications. Ideally, there should be a minimal need for data movement out of the Data Lake.

Finding Alternatives to MapReduce

There is more than one angle when you think about alternatives. The decision may depend on the problem you want to solve. Ideally, you would want to consider both current needs and future needs and total cost of ownership before making a decision.
&nbsp:

IssueConsiderations
Performance
As the use cases and workloads demand low latency responses, MapReduce, which is designed for batch processing and scale, is not cutting with respect to performance.
  • Leverage state-of-the cart caching architectures
  • Minimize adding additional hardware
  • Reduce the need for specialized programming languages
  • Parallel data architecture
  • In-memory execution
  • Reduce data movement however temporary that might be
Time to value
The need for Java programs and related complexity adds to the delays in time to value.  The longer it takes to create, test and deploy MapReduce jobs, the longer it is to create POCs, understand the results, and rollout apps into production, thereby getting the value out of Big data initiatives.
  • Move to a well-known query language like SQL
  • Enable faster experimentation for data scientists
  • Create and maintain queries using existing tool set
Lack of skilled resources and cost of resources
Finding the data team has become both time consuming and expensive.
  • Try to reduce dependency on data scientists and Java programmers
  • Leverage existing SQL developers to augment data teams
Going beyond analytics
The current focus is primarily on complex analytics in batch mode with high-latencies, many customers are thinking beyond analytics on Hadoop after the initial success of Big Data initiatives and strong understanding of what it takes to build and run a Hadoop deployment.
  • Unlock the data for different applications without adding additional costs
  • Minimize friction in migrating apps to data lake
  • Find data infrastructure that supports multiple workloads that supports operational, analytical and transactional workloads

Why SQL Engines Make Sense as an Alternative to MapReduce?

As choosing an alternative to MapReduce is a strategic decision, it makes sense to focus on a proven and familiar category of technology where enterprises have already invested from tools, apps, and resources point of view. Finding a SQL Engine that can work on Hadoop makes a ton of sense due to the following reasons:

  • Enterprises have a large number of IT resources that are trained in SQL and the industry has more SQL developers than data scientists and Java programmers
  • There are a large number of IT apps that depend on SQL to connect to a database to access data. SQL engines can facilitate easier migration.
  • Given the right SQL Engine, you may be able to have an optimal combination of SQL + MapReduce to get the best of both worlds

Finding the SQL Engine You Need Isn’t Easy

There are multiple options for enterprises to choose a SQL engine. Here are some considerations to find the right SQL solution.

  • Capabilities – Since the switch to a SQL engine will have lasting impact, find a SQL engine that can do more that what you are currently doing in terms of the types of workloads you are running.
  • Performance – This is a key problem with MapReduce, hence find a SQL Engine that can support your performance needs. Having an MPP-capable (Massively Parallel Processing) SQL Engine will work well.
  • Maturity – Creating a SQL engine isn’t a one year or two year effort. When you look at proprietary RDBMS vendors like Oracle, Microsoft, HP/Tandem, and IBM, you see the years and hundreds of millions of dollars spent in developing, perfecting and optimizing a RDBMS.
  • Cost – as you try to do more with Big Data, cost considerations become an important factor as you scale your deployment not just in terms of data size but also from applications and usage.
  • Extensible – As you adopt a SQL engine for majority of use cases and workloads, you may not want to maintain both SQL and MapReduce jobs. Look for a SQL Engine that can support analytics and running MapReduce like jobs within its DB engine.
  • Plays well with others – The SQL Engine should have support for various components / products within the ecosystem to enable a complete solution.
  • Cloud and On-prem – though (almost) everyone is thinking to move to cloud, there are quite a few enterprises are still deploying solutions on-prem. The SQL engine should support both types of deployments.

Download our complete checklist to compare different options for SQL-on-Hadoop solutions.

 

Our Take

As a team, which has been in the forefront of database industry for the last two decades and served global enterprise customers , we are quite excited to see the new developments and innovation in Big Data, giving new life to databases.  The time has come for enterprises to take advantage of this revolution and accelerate the ROI on their Big Data investments. However, like with any strategy, you need to have right tools and we are proud of what we bring to the market, EsgynDB, an enterprise-class SQL Database for Big Data powered by Apache Trafodion (Incubating).  When you buy a car, it is important to know the engine, in addition to looks. That’s why we put together a white paper on 13 questions you should ask your SQL vendors and what to see under the hood.

Download our White Paper – Does your SQL Vendor Make the Grade?

2018-06-05T14:10:11+00:00

About the Author: