The story of the industry's adoption of virtualization has a recurring theme: workloads once thought entirely unsuitable for virtualization make obvious sense in hindsight.
Virtualize my test and dev environment? It'll never fly. Virtualize a database? Heresy! Virtualize desktops? Ridiculous. Virtualize a large enterprise app like SAP? Foolish.
But all are best practices now, with substantial benefits and advantages that were largely predictable upon later reflection.
And here comes Hadoop -- the open source Giant Swiss Army Knife of big data analytics. Whether it's that first project, or a larger set of shared services, the number of IT shops who either are running it -- or plan to soon -- is growing nicely.
The first round of early adopters went physical -- as there were no reasonable virtualization alternatives. But for those that are building out their first environment (or their newest environment), many are seriously considering going virtual.
And, once again, the benefits are obvious in hindsight.
The Big Shift?
Over the last year or so, the Hadoop toolset has proven its worth for much more than the hard-core data science crowd. Indeed, many of the Hadoop projects I see these days have very little to do with data science, and far more to do with slinging data at scale.
Enterprise IT organizations are dumping their rivers of log files into HDFS, and exploring their internal operational data in all sorts of fascinating ways. The tried-and-true BI crowd is waking up to the fact that SQL need not be tied to a proprietary relational database or warehouse. The marketing group is going in deep: sifting social data and web logs to learn even more about their customers.
In 2003, when you had some data to analyze, you carefully built a relational database: structured fields, indexes, etc.. In 2013, you simply dump it into Hadoop and have at it. Use cases are proliferating everywhere. And I don't expect this trend to abate anytime soon.
Particularly fascinating to me are the new breed of "data lake" projects: vast HDFS-based data landing zones, offered as a service by enterprise IT to multiple business units who have a desire to collaborate.
This ain't your father's data warehouse :)
VMware's position is simple and expected: analytical workloads such as Hadoop are as amenable to virtualization as any other, and the benefits aren't hard to predict: simplified provisioning, unified management, elasticity, pooling of resources, increased resource efficiency, delivered as a service, multi-tenant isolation, enhanced security, etc.
A pioneering effort begun several years ago (Project Serengeti) has now matured into an official VMware offering: Big Data Extensions or BDE. BDE 1.0 was mostly about getting the basics right. Customer interest seemed to explode when it was officially announced at VMworld 2013.
How It Works
BDE encapsulates popular Hadoop distros in virtual machines, while preserving the compute/data affinity that is important from a performance perspective. Storage targets can be anything Hadoop supports: from simple bare metal disks to more advanced array-based implementations such as EMC's Isilon.
People's first concern is "doesn't virtualization hurt performance?" and the answer is clearly "no".
Dozens of apples-to-apples comparisons of physical to virtual show the same pattern: a few workloads run a few percentage points slower, a few workloads run a few percentage points faster.
It's basically a non-issue in the bigger scheme of things.
For IT shops that already are invested in a VMware farm (and that's just about everyone), implementing one or more Hadoop clusters is a simple extension of what they already know and are using -- same pool of resources, same tools, same operational workflows, etc.
For users of Hadoop, they get an easy-to-provision cluster that works the way they'd expect: tools, behaviors, etc. Except no need to buy dedicated hardware and stand up a hand-crafted and inelastic physical environment.
But there's more as you dig into it. The ability to independently vary compute and data. Controlled data sharing and multi-tenancy built on vSphere's familiar capabilities. A core technology that's familiar to the audit committee -- and so on.
This week there's a big event, tortuously titled "Strata Conference + Hadoop World" in New York. VMware is using this venue to announce new features coming in BDE. From my perspective, BDE 1.0 (announced in August) was mostly about nailing the basics: doing what you'd expect in a fully virtualized Hadoop cluster.
Now, just a few months later, we've got more goodies, as you'd expect.
So, what's new?
The most visible new solution is vCAC integration (pronounced either vee-cack or vee-cake depending on who you talk to, officially the vCloud Automation Center).
Prior to working at VMware, I wasn't entirely familiar with what it could really do: now I have a strong appreciation for its unique power as an uber-automator. It's often used to deliver easy-to-consume services that mask powerful automated workflows.
In this context, it's used to create Hadoop-as-a-Service (HaaS anyone?) Authorized users can select from pre-configured templates, and then progressively tailor the environment to their liking -- without IT really getting involved, other than setting up the service.
There's a slick demo video that's worth a quick view if you're interested.
There's also other potential interesting implications from vCAC integration: automating data extraction and workflows, perhaps even managing resources dynamically?
It could be that during the day you'd like to prioritize interactive performance, and at night use the same resources for deep analytics. Reconfiguring your assets now becomes something that's relatively easy to do on a repetitive basis.
BDE is all about distro choice, so there's now support for the increasingly popular Intel flavor through the VMware Ready program -- in addition to all the other ones already supported. There's also a lot of fit-and-finish work visible on the user interfaces, reflecting feedback of all the early adopters.
And Passionate Users Emerge
Amongst all the sessions at VMworld 2013, there was a BDE/Hadoop track that turned out quite well.
For me, the highlight was hearing from people who were actually using the stuff to get work done: FedEx, Northrop Grumman, T-Systems and Identified. They had awfully nice things to say :)
What struck me was the use cases: these weren't about supporting math PhDs creating machine learning algorithms, these were pragmatic bread-and-butter tasks that previously would have required vastly more resources and delivered poorer results.
For them, virtualizing Hadoop was simply the obvious choice -- just as it has been for all the other workloads.
Once Again, The Case Is Clear
It's a bit awkward for me to share "Benefits of Virtualizing Hadoop" slides, because -- in retrospect -- I feel like Captain Obvious.
Most -- if not all -- of these bullets could be applied to virtualizing test and dev, virtualizing databases, virtualizing enterprise apps, virtualizing desktops, etc. The bigger and more complex the workload, the bigger the benefits. And Hadoop definitely fits that category.
That being said, there are still pockets of pushback around virtualizing Hadoop. None of the objections are fact-based, though. Hopefully this will disappear over time just as it did for other enterprise workloads.
But there's more.
My advice to IT groups is that -- if you haven't stood up some sort of modest Hadoop cluster yet -- maybe you should.
For starters, once you get past the initial learning curve, you'll find it a powerful and efficient alternative to familiar query-oriented relational databases we all use day-in and day-out.
Unless you feel morally obligated to feed Oracle's bottom line, that is.
Perhaps more importantly, sooner or later someone is going to get the itch to experiment with Hadoop, and it would be great if IT could say "well, we have this environment already stood up …"
And, if you're already a VMware shop, nothing could be easier :)
Like this post? Why not subscribe via email?