An increasing number of IT groups are finding themselves adding one or more Hadoop clusters to the mix. Project Serengeti virtualizes Hadoop clusters, making them easier to deploy and easier to manage.
It's one thing for a vendor to claim that, and it's another thing entirely to find a customer that agrees with you.
This blog post comes to you courtesy of Sasha Kipervarg, the Director of SaaS Operations at Identified -- a targeted analytics-based SaaS offering for recruiting professionals. I was curious about how people actually use Serengeti with VMware, and Sasha was kind enough to be interviewed.
I found a lot that was interesting here -- maybe you will as well?
The Business Of Talent Recruitment
Identified has a neat business model: use powerful algorithms that uses social feed data to help zero in on hard-to-find talent.
Their first product -- Identified Recruit -- helps recruiters target scarce healthcare professionals such as registered nurses -- and has meet with good success. From my perspective, there's obvious room to expand in other dimensions: tech professionals as well as other hard-to-find talent.
Their "secret sauce" isn't in the data itself -- it lies in their ability to extract signal from the noise, and create an easy-to-consume experience for people who generally have zero technology background.
Sasha's job is to keep up with all of the IT infrastructure demands of a fast-moving new business -- and do so with very limited resources.
Sasha -- how did you get started with Hadoop?
Like most start-up companies, we got started using Amazon Web Services. But it wasn't long before we realized we needed something better. For one thing, AWS can get really expensive on a per-use basis, and our entire business is built on analyzing data.
The second problem was inefficiency. Anytime we wanted to do anything, we'd have to spin up a cluster, load our data sets, do our work, and then reverse the entire process. That took a lot of time.
We needed a home we could live in comfortably vs. simply renting a hotel room every night.
So, what did you do?
We invested in building our own private cloud: VMware, flash storage, etc. It's a good-sized cluster that handles about 2-3 terabytes of active data. Its primary role is to run the main shared Hadoop cluster, as well as collection of smaller one-off Hadoop clusters and other various workloads.
Most people are surprised about the modest size of our data sets -- it isn't big data, but it is very active data. For that reason, we invested in a flash-based SAN as well as server-based flash cards to maximize performance.
The important thing was keeping the same experience for our developers and analysts. They were quite used to spinning up resources on demand when we were on AWS, and of course they wanted the same experience with our in-house tech.
Did you consider a physical Hadoop cluster?
We also built a more modest physical Hadoop cluster (a handful of fat nodes, lots of CPU and memory) for a small number of jobs that required dedicated horsepower and needed the ultimate in performance and isolation. We're working towards pooling the compute and memory across the two environments, but we’ll keep the storage back-end separate for performance reasons.
What do you like about Serengeti?
It's a great way to get going on Hadoop quickly without knowing a ton about Hadoop. The whole environment came up in a few hours on our existing VMware farm with very little work on our part. It's a great way to get started and get going if you're in a hurry. Everything you need is right there. Our entire main cluster is 100% virtualized.
We're also using VMware's availability features to harden the control plane portion of our physical cluster -- the data and task nodes are physical, everything else is virtualized.
From a management perspective, we have the potential to run our virtual Hadoop cluster the same way we run the rest of our IT investment: using VMware tools.
Have you encountered any challenges?
Well, we're mixing vHadoop workloads with other workloads on our primary VMware farm, and we have run into situations where we end up pegging the bandwidth of the back-end flash-based SAN array. We're now using some of the newer IO control features in vSphere to mitigate the problem, and that's working out OK. At the same time, we'd gotten to the point where we felt we needed a dedicated cluster for a handful of jobs, and that's why we built the physical cluster.
How much data sharing do you do?
Just about everyone in the company is working off of the same data sets, so almost all of our data is being shared. Most of the Hadoop work is done on the shared vHadoop cluster for that reason; but we've got a handful of tasks that now justify a separate storage pool for performance reasons.
You're doing a lot with Hadoop and VMware -- what's coming down the road?
Well, for one thing, we'd like to standardize on a full SQL access method to our data -- there's such a rich ecosystem of tools and people who know how to use them, so we'll be looking at Hive and similar SQL interfaces going forward.
We've got some very good automation with our production VMware cluster -- the one that runs the rest of the business -- and we're looking forward to recreating that operational model on our vHadoop cluster. Right now, it's sort of a semi-manual process to run things there. As I mentioned before, we'd like to create a single pool of virtualized compute and memory resources between our virtual and physical Hadoop clusters.
We're also starting to realize that -- before long -- we'll have to start backing up portions of our Hadoop environment, as we've started to create unique data sets that would be very difficult to recreate.
How much Hadoop expertise do you have in house?
Not a lot, but we're learning quickly. We want to get a better understanding of the components, how they work and interact, and get more familiar with some of the newer alternatives to the standard Apache distros that are out there. Right now, we're a Cloudera shop.
Thanks, Sasha, for sharing with all of us. I'm sure it's very much appreciated.