When Cloudera and Hortonworks announced their merger, bloggers masquerading as pundits were quick to jump on their drums to claim that this spelled the slow demise of Hadoop in the face of the Cloud revolution. That Hadoop was too complex, and the Cloud solutions were easy for people to use. But this is too naive an analysis and is somewhat self-serving for the Cloud vendors. Just like it was in the best interest of Oracle, Teradata, and other vendors to fight this wave of Hadoop, and a swell of open source revolution, since that was impacting their bottom line, now it is the turn of the Cloud vendors to claim the same to boost their bottom lines. There is no question that the Cloud has gained unprecedented adoption, and its growth has accelerated by demonstrating value for enterprises big and small. But let us understand what is at play here.
- Complexity of Hadoop versus Simplicity of Cloud
- Cloud versus On-prem tradeoffs
- Hybrid Cloud / Multi-cloud trends
- Proprietary versus Open Source
Remember, LinkedIn, Facebook, Yahoo, Uber, Airbnb, and several other very large enterprises, have huge investments in Hadoop and open source technologies to address their Big Data and Analytics needs.
Complexity of Hadoop versus Simplicity of Cloud
There is no question that early adopters on the bleeding-edge of Hadoop, using open source Apache versions of Hadoop and other Big Data open source solutions, had a tough time integrating these technologies and making them work well operationally. But Cloudera and Hortonworks have had Hadoop distributions for a while now, that go a long way in integrating these important components for Big Data and Analytics. They have made Hadoop operationally much simpler to develop with, deploy, monitor, and manage.
The difference is not so much that of the complexity of Hadoop versus the simplicity of Cloud solutions, as it is the complexity of developing, deploying, monitoring, and managing on-prem environments, regardless of the technology being deployed, as compared to Cloud environments. Cloud environments were not mature enough to provide the value they do now, compared with having large teams managing on-prem deployments, leave alone the infrastructure cost of hardware and data centers. That is, if an enterprise had to deploy Oracle on-prem, while such environments perhaps have had a longer run than Hadoop, it is still a lot simpler for a company to purchase that capability as a service from a Cloud vendor.
If Hadoop is offered as a service in the Cloud, it would be a lot easier for a company than trying to configure and get it running and managing it on-prem. Isn’t that what AWS EMR or Azure HDInsight are in effect? So, it is not the complexity of Hadoop and its lack of capability versus the Cloud provider solutions, but the benefits of Cloud versus on-prem. Certainly, no one is arguing against those benefits. In fact, a “Cloud first” strategy is the recommended strategy for any enterprise.
Cloud versus On-prem tradeoffs
While everyone agrees that a “Cloud first” strategy is the right one, no one is promoting a “Cloud only” strategy. And that is because there are tradeoffs to consider. It won’t come as a surprise to many that while the first few months of a Cloud engagement look like a heck of a deal compared to the on-prem deployment, as full production hits peak levels, the Cloud bill can cause sticker shock. It is all the nuances of compute, storage, data flow, and a charge for each of the components used, that all adds up to make for a healthy bill. Nothing wrong with that. But certainly, something that needs more diligence when assessing the Cloud versus on-prem tradeoff. Gartner provides a tool, “Gartner Cloud Decisions”, for its customers to assess these tradeoffs.
When comparing these cost tradeoffs, one also needs to consider the price to performance tradeoffs. For example, there is a huge trend for companies to put all their data on AWS S3 or Microsoft Azure Data Lake Storage (ADLS). Compared to block storage, these object storage solutions can provide big savings in storage costs. This has been used in articles to demonstrate how Hadoop storage is now more expensive than Cloud storage. But that is not an apples-to-apples comparison. Netflix uses Hadoop technologies like Apache Parquet and stores most of its data on AWS S3. So, it is not that Hadoop storage is inherently expensive. It is the choice of block storage versus object storage that is the difference here.
Articles do not discuss the nuances behind such storage. There could be a large performance impact on your workload. Block storage is more efficient for databases than object storage is. And that is even without the potential hop to object storage from the compute infrastructure. A database engine can access data co-located to where it is processing the data heck of a lot faster than accessing another cluster where the data is residing, even in block storage, leave alone object storage. How much is this impact for your specific workload, and whether this will impact Service Level Objectives, is certainly something that an enterprise needs to assess. The latency introduced may or may not be insurmountable. The tradeoffs need to be understood. But comparing cost of remote object storage to local block storage is too simplistic. Performance is a key consideration.
A company needs to consider what it should store in block storage, what it should store in object storage, and how should it aggressively manage the movement of data from block to object storage to get the right performance-to-cost balance. In fact, with other hardware options such as persistent memory, there might be multiple tiers of hot, warm, cold data to consider and manage to reduce storage cost.
The other not as well understood tradeoff is that a Cloud vendor is running your workload on a Virtual Machine, where you are co-located on the same hardware infrastructure with other customers of the Cloud vendor, using the same servers – CPU, memory, disk, etc. Now, there have been several innovations and improvements where the impact of workloads in one VM to a workload in another VM have been minimized. But there is still an impact. If your Service Level Objectives for peak workloads are very stringent, then again there may be some level of concern as to how much of an impact and variability in response times would a Cloud deployment have on your workload.
The argument is not to avoid the Cloud, but to be more diligent in assessing the tradeoffs.
Hybrid Cloud / Multi-cloud trends
Analysts and industry watchers have all indicated that there is a huge trend towards Hybrid Cloud deployments – where a company may have deployments in the Cloud as well as on-prem. And this is not just because they have legacy systems that may be harder to move to the Cloud, or depreciated hardware and data center resources that make it more cost effective to maintain on-prem systems. Some reasons are outlined above, but there may be many other reasons to keep current or deploy new applications on-prem. There are many reasons where a company may even consider Multi-Cloud deployments – where not only do they have applications in the Cloud and on-prem, but with different Cloud providers. The motivations for these are many and outlined in various articles.
So, when people say customers are deploying Cloud provider solutions, it means that they are committing to Cloud specific proprietary solutions. If you are on AWS you might use Redshift. On Azure you may use Azure Data Warehouse. But you cannot use those on-prem. And you certainly cannot use those if you need to deploy on multiple Clouds. And maybe these are all independent applications and deployments, where you could commit to different technologies on AWS, Azure, or on-prem. But then you need to develop the skills to develop and support applications running on all these different databases and environments. Each will have their own monitoring and management solutions that are not integrated with each other. So, from a development and operational perspective that would increase complexity and cost. In other words, it may not be prudent to commit to a Cloud vendor specific solution, without assessing the impact of that decision on your overall development and operational complexity.
Proprietary versus Open Source
Most Cloud provider technologies are proprietary. Other than some technologies, such as TensorFlow from Google, Cloud providers have not open sourced their solutions, and these solutions do not run on another Cloud vendor. So, instead of choosing a proprietary technology from one Cloud provider, you can choose a proprietary solution that runs on multiple Cloud platforms as well as on-prem, paying the premium proprietary license costs. Or you can leverage Open Source based technologies, which can also be deployed across all Cloud platforms and on-prem, but at open source license costs. That is where Hadoop and its ecosystem of solutions comes in – providing a myriad of open source cross platform solutions, being developed by large communities, that provide you the flexibility to customize solutions for your competitive and expeditious needs. The value of open source has been discussed at length, and certainly deserves consideration compared to proprietary single Cloud vendor-based solutions, especially if hybrid or multi-cloud strategies are part of your strategy.
Given all these factors, Esgyn provides a database as a service, called Esgyn Strato, that can be deployed on AWS, Azure, or Google Cloud. This provides the same simplicity of a Cloud deployment versus on-prem where the environment is pre-provisioned and managed for you and the complexity of configuring, provisioning, monitoring, and managing are reduced substantially. EsgynDB, that Esgyn Strato is based on, can also be run on-prem. This allows the spanning of applications across multiple cloud and on-prem environments, as well as a single interface to monitor and manage those resources across instances.
EsgynDB does one better. While many of the Cloud providers require different proprietary solutions for NoSQL, OLTP, operational, BI (warehouse), and Analytics environments, EsgynDB can address all workloads equally well, leveraging the storage engines appropriate for the workloads. In other words, Hybrid Transactional and Analytical Processing workloads. It also provides full enterprise capabilities such as Disaster Recovery, Point-in-time recovery, multi-tenancy, mixed workload management, rock solid security, and High Availability. It is by far the most comprehensive, mature, and enterprise class solution for scale-out elastic Big Data and Analytics.
It leverages open source investments continually being made by large communities to Hadoop (specifically HDFS and Zookeeper), Apache HBase, Apache ORC, Apache Parquet, and Apache Trafodion. Esgyn provides a Managed white glove Service where database experts with decades of experience can help the customer with the tradeoff decisions mentioned above, to provide the best deployment model to achieve Service Level Objectives to meet Service Level Agreements.
So, the demise of Hadoop and its eco-system, that includes Big Data and Analytics SQL solutions, are highly exaggerated. In fact, they provide the architecture for the future of multi-cloud and on-prem deployments by avoiding Cloud provider lock-in.