For many, Hadoop has become much more than just an interesting data management technology -- it's starting to be seen as the potential strategic successor to the familiar relational database world.
Along those lines, EMC's Greenplum division -- now part of the nascent Pivotal Initiative -- has announced a considerable suite of extensions and enhancements to the increasingly popular Hadoop environment.
But please don't get lost in the product detail -- there are some very big and potentially controversial ideas in play here.
The role that Hadoop is now playing in big data analytics is now unquestionable; the race is now on for who can provide the best enterprise platform and ecosystem. And, from where I sit, the new Pivotal group is doing a stellar job of earning that title.
With this announcement, there's a new high-performance SQL layer to make more data available to more data workers.
There are new capabilities to parallelize analytics processing. There's now a meaningful set of integrations with VMware and virtualized infrastructure. There are new consumption options for infrastructure and especially storage.
And there's a new set of management tools aimed squarely at the administrators of these environments.
All of this wrapped up in a new name (Pivotal HD) that reflects the new direction of the team.
Hadoop is growing up -- fast.
The Big EMC Bets Around Hadoop ... and HDFS
Few open-source tools have enjoyed the meteoric popularity of Hadoop in building these next-generation big data analytics platforms.
Even in its rawest distro form, it's eminently flexible, scalable and very cost-effective. As a result, Hadoop has quickly become the new de-facto standard for anyone doing anything in big data analytics computing.
But there are even bigger ideas in play here ...
We believe that HDFS (the underlying data abstraction beneath Hadoop) will play a key role as the future "data substrate" for next-generation data infrastructure. The familiar relational database that's powered data-based processing for the last few decades will likely be subsumed around newer capabilities built on top of HDFS.
That's a pretty bold statement ... so let's look at the case for such a transition.
The Case For An Industry Transition?
To really appreciate why one well-understood approach to managing data might be subsumed by a newer approach, it's important to understand the context.
At the highest level, there's a pronounced shift to digital business models, ones where the entire value proposition centers around gathering, storing, analyzing and leveraging massive amounts of information. The words may change, the underlying concepts don't.
Study the people who are now doing things the new way, and their information management patterns are markedly different than before. For one thing, they're strongly incented to keep as much data around as humanly possible, preferably in its native and raw form with as much fidelity as possible.
The scale of these endeavors start very big, and get much bigger very quickly. The previous generation of data management technologies envisioned a world of terabytes; we currently measure these environments in petabytes (1000x) and that nomenclature won't last more than a few years before we start routinely talking in exabytes (1000x again!).
Simply attempting to cram 1000x everything into your favorite relational database -- designed for a very different world -- forcefully breaks the model in so many ways: scale, performance, cost, openness, etc. It's like trying to use dial-up modems to stream HD video.
You go looking for a better answer.
Your quest is for a standard data management platform to build your next-generation of big data applications -- a platform that anticipates the future, but can bring along some of the legacy as needed.
That's the role we see HDFS playing over the next few years -- it's the new data management platform for the big data era. There's clear opportunity for innovation both above and below that key abstraction.
If you fit this profile, you certainly will be looking for flexibility in your deployment model: physical hardware, a private cloud, or a more public one -- depending on your needs. Just to be clear, the decision to build on one data management platform shouldn't dictate the deployment model underneath.
Yes, there are lot of cool product details here to pore through, but the real message here is a bold statement about the future of data management in a big data world.The Hard Part Is The Unlearning
In order to learn the essence of big data analytics, you have to essentially unlearn the familiar disciplines around BI -- business intelligence. Indeed, anyone who meets with customers on big data analytics is familiar with the need to unwind people from what they think they already know.
be clear: there's nothing wrong with BI and reporting against traditional databases -- it continues
to have its role -- it's just not what's being discussed here.
Perhaps the most important difference is in the use of data. BI tends to use small, uniform historical data sets. By comparison, big data analytics thrives on diversity of data -- the more sources, the better.
Indeed, there used to be continuing debate in big data analytics --
what's more important in creating better predictive models: better math
or more diverse data?
The jury appears to have decided this one: the clear winner is more data diversity -- fresh, raw and unfiltered -- usually results in better predictive models. That particular mindset is usually an anathema to traditional BI thinking: an innate desire for one source of truth, built using cleansed and sanitized data, sourced from limited number of data sets with limited history.
Indeed, look closely at most established ETL (extract, transform, load) processes and the common goal seems to compress and filter as much as possible.
In many regards, big data analytics appears to invert as many BI assumptions as possible: the attractiveness of having widest possible range of relevant, raw and native fidelity data sources -- internal and external. Having access to a wide variety of tools that empower data workers to experiment with data, and help them collaborate around the results. An iterative methodology to propose models, and validate their predictive power through experimentation.
It's not your father's BI.
For example, it does a great job with unstructured data, but its current capabilities are less formidable when considering familiar structured data sets.
There are clear opportunities to make Hadoop environments more performant, more efficient and more operationally robust: data management, consolidated operations, enterprise hardening, etc.
And, more importantly, there's a wide world of great tools that already know how to speak SQL, and people who know how to use them.
That useful expertise has now been brought to the HDFS world -- large-scale tables can be created and managed on top of HDFS vs. a more typical database.
This is somewhat of a big deal, if you think about it.
First, there's a huge inventory of programs and tools that speak SQL fluently. These can be now be pointed at newer HDFS sources much more flexibly. Second, there's a vast universe of data workers who understand SQL. Their skills and expertise can now more easily be brought to these newer platforms.
And, thanks to parallelism and flat data representations, performance is in an entirely new class.
With the new technology -- dubbed HAWQ (Hadoop With Query) -- performance has been improved to the point where most queries are now interactive vs. batch. The wide world of SQL tools, extensions and expertise can now be applied directly to HDFS data sets.
The standard query engine in Hadoop (Hive) isn't known for its performance. I've included a graph that shows a variety of queries, comparing the performance of the new Advanced Database Services vs. Hive.
You'll have to look very closely for the itty-bitty blue bars showing the elapsed time for the new Pivotal HD queries -- it's that good.
... And Parallel Analytics Libraries
While a great deal of the querying and filtering is highly parallelized in Hadoop environments (and thus very efficient), the heavy-math analytics components usually aren't.
As part of this release, Pivotal HD now supports a rich framework for parallelizing the analytics as well -- pushing them closer to the data and nodes, if you will.
For those of you pushing the envelope on their analytical models, this is a nice leap forward.
Extensibility Through GPXF
In many ways, the new Advanced Database Services found in Pivotal HD can be thought of as a "universal adaptor" for the Hadoop platform.
Part of that is the Greenplum eXtension Framework (GPXF) which provides a generic mechanism for processing different file formats through the native SQL.
Modern data science presumes an inordinate diversity in data sources; predictably this also results in a requirement for a wide variety of data formats. There's a wide selection of file formats available today, and an additional mechanism that enables or a member of the community to easily add new ones. Slick.
The Greenplum Command Center
In an enterprise, one or more big Hadoop clusters have to be managed like any other set of large-scale enterprise IT services.
The relatively-new role of "Hadoop Cluster Admin" is now starting to look for tools to help them get their job done, and that's part of this announcement as well.
The new Greenplum Command Center is system monitoring for a Pivotal HD environment. It provides a real-time and historical view across three control planes: individual host (or node) level monitoring, application-level HDFS monitoring, as well as monitoring specific MapReduce jobs.
While there's much more than can potentially be done here, it's a good start towards operationalizing production Hadoop clusters.
Pivotal HD is software -- but it obviously needs to run somewhere.
For those customers who prefer to run physical on their choice of industry standard servers, that option continues to be well supported.
The recently-updated Greenplum DCA is a turnkey appliance running Pivotal HD for those who want a complete hardware and software solution, supported by a single vendor.
And the certainly more intriguing third option is running virtual. More people are coming around to the notion that big data workloads can benefit from running virtual, just like other workloads.
Pivotal HD in A Virtual World
There are two pieces to consider here: HVE and Project Serengeti.
HVE (Hadoop Virtual Extensions) is a VMware project that adapts many of the benefits of server virtualization to specific Hadoop workloads: it introduces the notion of a "Node Group Layer" (a collection of virtual machines running Hadoop on physical servers) and implements useful policies -- for example, ensuring that replica nodes don't end up on the same physical node, or balancing workloads within a host, or more efficient task scheduling through resource optimization.
The big idea here is simple: to enable compute/data node separation, without losing the benefits of locality -- basically, having your cake and eating it too.
Project Serengeti is an open source project that aims for full containerization of complex Hadoop landscapes, making them easier to instantiate and manage using many of the same template-based approaches already in use for fully virtualized environments.
Although most people tend to think in terms of a single, ginormous Hadoop cluster (and there are plenty of those), there's also a role for an easy-to-set-up, easy-to-tear-down templated version running off of shared resources: testing, smaller projects, multiple Hadoop clusters on a single shared infrastructure pool, etc.
Pivotal HD fully supports and integrates with both HVE and Project Serengeti. Additionally, developers using the Spring framework will find a nice set of capabilities for working with Hadoop-based services.
Two Storage Options
It's big data, so storage must be important, right?
HDFS typically runs on a RAIN-ish architecture -- 100% of the data is replicated across three physical nodes (one copy in the same rack, one copy in a separate rack) to deliver acceptable availability.
But it's clear you're only using about a third of your physical capacity using this approach. And we know we can do better ...
EMC's Isilon presents a native HDFS interface in addition to NAS, CIFS, etc. -- they're the exact same files, just with a different set of semantics and a hardened name service. Using Isilon, storage utilization jumps to 80% (that's a big deal when we're talking petabytes), and -- perhaps more importantly -- there's no need to copy a mountain of data from NFS to HDFS, run your analytics, and copy the result out of HDFS to NFS.
Real-world workflows are thus dramatically compressed in time -- and that's a big deal as well.
There's more to be intrigued with: high-availability name node services, the ability to scale compute and storage independently, familiar data services like snaps and replications, the ability to support one or more Hadoop clusters as part of a single Isilon cluster and namespace, and so on.
Pivotal HD supports both storage choices, with Isilon's HDFS interface exposed to the Greenplum Command Center described above.
Putting All The Pieces Together
This graphic here does a good job of summarizing all the major functional components of the quickly-evolving Pivotal HD platform -- and where they come from.
Moving up, there are infrastructure choices: physical or virtual, traditional HDFS implementation or Isilon, and so on.
The standard data processing tools (HBase, MapReduce, Pig, Hive, Mahout, etc.) are now augmented by the new Advanced Database Services, bringing the SQL world forward in a meaningful way.
The existing management and workflow tools (Yarn, Zookeeper) are now complemented by the new Greenplum Command Center. More work to be done, but it's a very credible set of offerings in its current form.
Hadoop is growing up. And the expanded Pivotal team is certainly doing a good job of leading the way.
Behind The Scenes -- The EMC Resource Commitment
If you're an enterprise, and you see yourself doing big stuff with big data, you're going to want your chosen vendor to have big company resources behind it -- just like other forms of IT.
And EMC -- through Greenplum and now Pivotal -- has assembled a surprising amount of talent and capabilities in a relatively short time.
Most people are aware that most of the foundational work around Hadoop was originally done at Yahoo. I think we've done a good job rounding up some of that original talent, and complementing it with a serious complement of real-world practitioners.
Simply put, we're betting big here, as evidenced by the serious horsepower we've assembled so far.
And it's not just people we're investing in; there are some pretty large resource investments as well -- such as the 1,000 node, 24-petabyte multi-million dollar Greenplum Analytics Workbench -- a facility dedicated to the testing and validation of Apache releases at scale. Findings and improvements are cycled back into the source distribution as well as Greenplum HD.
We're not just packaging an open source distro and sending it on its way -- we're committed to advancing the state of the art for serious practitioners.
A New Category Is Born
There's the familiar Hadoop -- and now I think there's "enterprise Hadoop".
I believe people will start to look at this nascent category through two distinct lenses. Most will see an improved and enhanced version of Hadoop that's suitable for enterprise-like use cases. And there's nothing wrong with that.
While the open source distro certainly has its charms -- at some point, more than a few people will go looking for enterprise-class features (and support!) that maximize the value of a considerable investment, and minimize the costs (capex and opex) associated.
Right now, Pivotal HD is at the head of the class for these people.
But I think there will be more than a few forward-thinking folks who look at all of this with a gleam in their eye, and understand that there's a sea change occurring about how we think about data management layers in a big data world.
The baseline is the familiar Apache distribution with enterprise-class support.
Add in a proven MPP database to complement Hadoop processing. Add in an advanced version of parallelized SQL to bridge big data with existing tools and skill sets. Throw in the ability to parallelize sophisticated analytical processing, as well as adaptors to a variety of data types.
Create a range of deployment options from traditional, to packaged and -- increasingly -- virtualized. Start to deliver management tools for the administrators. Offer an enterprise-grade storage alternative to the traditional roll-your-own. Back it all with an amazing development, support and consulting team.
Hadoop is growing up -- and fast.