Virtualize something -- anything -- and you make it easier for everyone to consume: IT vendors, enterprise IT organizations -- and, most importantly, business users. The vending machine analogy is a powerful and useful one.
At a macro level, cloud is transforming IT, and virtualization is playing a starring role.
Enterprise-enhanced flavors of Hadoop are starting to earn prized roles in an ever-growing variety of enterprise applications. At a macro level, big data is transforming business, and Hadoop is playing an important role.
The two megatrends intersect nicely in VMware's recently announced Project Serengeti: an encapsulation of popular Hadoop distros that make big data analytics tools far easier to deploy and consume in enterprise -- or service provider -- settings.
And if you're interested in big data, virtualization, cloud et. al. -- you'll want to take a moment to get more familiar with what's going on here.
The Power Of Hadoop
Hadoop is a popular collection of diverse tools that work together in a loosely-coupled fashion to tackle a wide variety of fascinating big data analytics problems. Within Hadoop, you'll find file system tools, data management tools, job scheduling and workflow tools, analytical tools, and a whole lot more to pick through.
The more tools, the better. Not every tool is important on every project. And you've always got an eye out for a better tool here or there.
Since Hadoop is essentially an open source distribution of this tool collection, a variety of small software companies have sprung up to create enhanced versions of the tools, perhaps emulating what we saw in the Linux world. Indeed, you'll see investors and venture capitalists attempt to build a case around what happened in the Linux market and what they'd like to see happen in the Hadoop market.
I think they're somewhat misguided -- they should be looking at Ingres and similar open source data management projects for a historical precedent, and not an operating system -- but that's not my concern today.
I used to keep a running list of stories I'd heard of people using Hadoop for one thing or another. I've given up. There's just too many use cases: web logs, textual analysis, machine-to-machine interactions, security logs, etc. etc. It's spreading everywhere, like SQL did thirty-odd years ago.
Any time you're interesting in wrangling some massive, uncorrelated data sets, people are reaching for the Hadoop toolbox. That's cool.
Despite the power of the tools, there are three rather big areas for improvement that we as vendors can work on.
First, no one ever claimed that Hadoop environments were easy or efficient to deploy. Second, no one ever claimed that Hadoop tools were easy to consume for the casual analytics user. And, finally, no one ever claimed that Hadoop-based applications were easy to integrate alongside other enterprise tasks.
And our friends at VMware have made clear and meaningful investments in all three areas: making the power of Hadoop easier to deploy, consume and integrate.
The Power Of VMware
If you're reading this blog, the impact of VMware on the IT industry needs no explanation: the company and the technology has fundamentally changed the way we think about IT infrastructure, cloud, delivering IT services, etc.
The value proposition seems to be infinitely extensible: apply virtualization to Thing X, and Thing X immediately becomes more attractive.
Project Serengeti is just another example of that principle at work. Take today's collection of Hadoop-related open source tools, pop them into a set of well-managed virtual containers, and all of the sudden they get a whole lot more attractive -- especially to enterprise IT organizations as well as the service providers that want to sell to them
The sound-bite associated with the project is a good starting point: one-click provisioning of Hadoop landscapes. Uhh, that's a really big deal, based on what I've seen.
Today, we're usually talking the physical provisioning of dozens (or sometimes hundreds) of servers, boatloads of raw storage and associated network components done old-school on bare metal.
Once that fun is done, we're then talking downloading dozens of open source components, installing them, firing them up, and configuring them to work with each other. Weeks can become months very easily.
So the idea of an on-demand Hadoop environment running on your familiar and existing infrastructure becomes a very interesting proposition indeed.
But There's More
Consider, for a moment, just a small set of typical enterprise requirements. Availability, as just one teensy example.
One area of attention from the aspiring Hadoop distro club is the need to harden the name node in a cluster. The node goes away for any reason, the environment is unusable, and it can take a long time to recover. Uhh, in a VMware environment, that name node is just another application, so you get the whole VMware extended HA thing.
Not to mention the ability to use something like an Isilon array to make the problem go away once and for all.
How about elasticity? You know, variable amounts of compute, memory and storage? One of the drawbacks of a classic Hadoop infrastructure implementation is that they're about as rigid as they come: fixed number of nodes, fixed amount of compute/memory/storage, fixed ratios between the components, etc.
Put the environment in virtual containers, and they become nicely variable and elastic.
I, for one, can easily imagine one or more Hadoop environments being spun up or spun down on something like a Vblock as just another variable workload in the mix. No big deal anymore -- just another thing that the business wants to do on a standard infrastructure platform.
The encapsulation model of VMware provides a useful security hook for policy controls and enforcements -- again, no one was really thinking about security too much when Hadoop was developed.
There's more, but I think you get the idea.
Take something interesting (like Hadoop in the enterprise), wrap it in virtualization, and that thing becomes much more appealing to all involved.
I Could Just Stop Here ...
... but there's more to the story you should consider. Project Serengeti is useful and compelling it its own right, but the picture gets a lot more compelling when you consider the other investments VMware is making along these lines.
For starters, consider the vFabric Data Director. Data science and predictive analytics environments tend to do a *lot* of ad-hoc database creation and destruction. It looks like Data Director was almost explicitly designed for this use case.
You might remember VMware's recent acquisition of Cetas -- essentially an easy-to-consume portal for analytics-as-a-service. Or Spring's announcement of support for the Apache Hadoop distro to enable application developers to work with Hadoop-based analytical processes as part of next-generation predictive applications.
And all that's before you go looking inside EMC and see everything we're doing with Greenplum, Isilon, RSA, IIG and other parts of our portfolio.
Small Announcements Can Have Big Impacts
The team at VMware didn't make an inordinate amount of noise around this particular announcement. I think that's appropriate.
Those who are watching this space closely will likely understand the impact and benefits of Hadoop-based tools running nicely in a virtualized environments. They won't need a big event to tell them that it's an important development.
Cloud transforms IT. Big data transforms business.
And VMware is now clearly at the intersection of both.

Hi Chuck
Been a while since I read your blog. Does not look like you have skipped a beat!
The big data thing has amazing potential but with short term pitfalls. I watched a company launch a huge "big data" initiative four years ago, and then...nothing. Some cloud rained on big data. Hadoop became Hadoops!
I love your tool kit analogy. As in mining, the several stages require very different skills and tools. Upstream, downstream and distribution as the Oil companies say.
Perhaps, just as in oil, we will have a few "big data" companies with a critical mass of data reserves from whom anyone can buy a gallon of useful data.
How about the following advice to companies - don't buy a big data rig and all the tools. Go to the pump and get a gallon of data - let some specialist do the prospecting, mining, refining and distribution. If you have oil in your back yard (data in your cloud?), get an oil company to drill the well and give you the royalties by marketing the value more widely.
Perhaps another billion dollar business for Google and the six others who will be the seven sisters who rule the big data world.
Posted by: Sukh | June 13, 2012 at 07:32 PM