It's great to see so many businesses start to experiment with Hadoop and its unique toolset. But I'm sure all this exciting experimentation is creating more than one headache for the IT team.
Today, VMware announced some very useful extensions to Project Serengeti, essentially allowing the creation of virtual Hadoop clusters with very different characteristics to be easily delivered as-a-service on top of existing VMware infrastructure.
While I'm sure there are Hadoop purists who might object to anything other than a dedicated, bare metal hand-crafted implementations, I think the VMware team has come up with a great capability that encourages increased experimentation of Hadoop's capabilities across the business without breaking the bank.
Take any interesting business process -- anywhere -- and it's a candidate for radical improvement using predictive analytics and big data.
From revenue generation (marketing analytics, sales productivity) to cost-savings (supply chain, logistics, etc.) and even the IT team (advanced security, demand forecasting, etc.) it's turning out to be a corporate tool that can be used just about everywhere and anywhere.
But this experimentation comes at a cost. While many of the software tools (e.g. Hadoop, MapR, etc.) are near-free, the infrastructure most certainly is not.
The conventional approach is that every use case wants its own dedicated Hadoop cluster and storage farm.
Each one of these experiments takes time and money to set up. Hard initial choices have to be made around sizing, as well as compute/storage ratios. Not every business experiment is successful -- and that's to be expected. And, ideally, data sets could be shared across all potential users -- rather than being hard-wired to a specific Hadoop cluster.
But in addition to costs, there's a thornier issue at hand -- and that's being responsive.
If IT sets the justification bar too high -- or IT reacts too slowly -- the potential outcomes are less-than-ideal: far less experimentation will occur, or -- alternatively -- motivated business users will set out on their own and go elsewhere.
Enter Project Serengeti
I first wrote about this Hadoop-meets-virtualization combination close to a year ago. Since then, both the market and Serengeti have matured considerably.
The core idea is simple: VMware now provides tailored tools (via Project Serengeti) that makes it far easier for a virtual infrastructure administrator to spin up one or more virtual Hadoop clusters using the existing pool of resources.
The VI admin then simply hands off the provisioned environment to the Hadoop administrator, who then goes about their business as if they were using dedicated servers and storage.
Hadoop -- and all that goes with it -- simply becomes yet another enterprise workload that runs on a common, shared and virtualized platform.
That's about it. But simple ideas can be very powerful indeed if executed well ….
How It Works
VMware offers a downloadable vApp with everything you need to get started. The standard Apache Hadoop distribution is included, but there's also support for other popular distros: Cloudera, Pivotal HD, Horton, etc. Your choice.
The virtual infrastructure administrator can use a fully templated, menu-driven process to set up a new Hadoop cluster, or can edit configuration template code directly if desired.
The Hadoop cluster admin now can go about their business with no change in process or model.
Because it's built on VMware, all the familiar VMware goodness comes along for the ride: high availability features, resource partitioning and QoS management, DRS, etc. -- even a nice cadre of VMware-based IT service providers if you'd prefer to rent vs. buy ;)
Why I Think This Is Important
So many IT organizations I meet are desperately trying to standardize and virtualize across their environments. The prospect of having to stand up multiple, dedicated Hadoop infrastructure stacks in support of new business experiments is somewhat of an anathema to them -- as it should be.
Second, while there are most certainly many organizations who have substantial Hadoop farms built out, there are far more who are just getting started. VMware's approach encourages this experimentation across multiple business units, without having to commit the organization to a specific direction and associated investment.
Third, the management of Hadoop infrastructure is now completely consistent with other virtualized workloads -- same tools, same processes, same capabilities, etc. That's a big win right there.
Fourth, all the compute resources associated with any instance of Hadoop are now pooled and dynamic: between Hadoop users, and -- of course -- with other compute tasks in the larger environment.
Finally, let's not forget any VMware environment is inherently agile: respond fast, size-up and size-down, re-use assets when no longer needed, etc.
And when the business is trying to master a new set of tools, agility becomes more important.
For More Information
There's no need to wait -- you can get started right now if you like …
The downloadable vApp with Serengeti 0.8.0 is here -- it's completely free. There's a great write-up on comparative performance using a variety of workloads and configurations, as well as a good description of the HA capabilities. Richard McDougall of VMware wrote a nice post, here.