If you're like me, you've been promoting the idea of server virtualization for many, many years.
You're also probably familiar with the standard pushback: what about performance?
I can clearly remember going through before-and-after charts over and over -- again and again -- for workload after workload: databases, Exchange, web servers, etc.
You had to convince people one painful step at a time.
Here comes a new workload that more IT organizations are stepping up to: Hadoop in all its forms.
While most IT professionals can see the many, many benefits of virtualizing Hadoop environments, they frequently encounter stubborn resistance from people who "just know" doing so will unacceptably impact performance.
Well, they're wrong. And there's hard data to prove it.
Big Data At VMware
They've already brought you HVE (Hadoop Virtualization Extensions) as well as Project Serengeti.
Today, another accomplishment: both VMware and Cloudera announced that they've done extensive joint qualification and performance characterization.
That's all well and good, but what is *really* interesting is the performance white paper published to go with it. If you're a fan of hard-core technical detail, you'll love this one -- it's a wealth of useful information.
From the introduction:
A cluster of 32 high-performance hosts was used to run three demanding Hadoop applications. The performance of native and several VMware vSphere® configurations was compared. The apples-to-apples case of a single virtual machine per host shows performance close to that of native. Improvements in elapsed time of up to 13% can be achieved by partitioning each host into two or four virtual machines, resulting in competitive or even better than native performance. The origins of the improvements are examined and recommendations for optimal hardware and software configuration are given.
The bottom line?
Sort of what you might expect: the "performance tax" is minimal with a single VM per server, and virtual can actually get a fraction faster than physical as you add more than one VM per server.
There's also a wealth of practical advice for setting up your (virtualized) Hadoop cluster: sizing of VMs, ratio of disks to compute, and so on.
I enjoyed reading through it, but I'm a little strange in my tastes. Maybe you're strange like me.
Will Hard Data Win The Debate?
You might think that solid data might convince someone that their biases might be incorrect. Maybe that's true in some cases, but people can be very stubborn indeed — often beyond reason, it seems.
For those of you who agree that virtualizing Hadoop makes eminent sense -- but aren't looking forward to the inevitable pointless debates, I'd like to remind you of an old story …
Many years ago when all of this was rather new, I asked an IT leader how he convinced his users to let him virtualize their applications.
"I never told them what I was doing".