It hasn't taken long for Hadoop to be added to the long list of workloads that IT groups are now potentially responsible for. While not as ubiquitous as, say, Microsoft Exchange, you can start to see the occasional new Hadoop cluster pop up here and there.
When you go looking for industry events that discuss the required infrastructure -- design and operations -- there's not a lot to choose from quite yet. But I did see that there was a clearly identified "infrastructure" track at the upcoming Hadoop Summit in San Jose, June 26-27th.
If you're going, I and my VMware colleagues would very much like to meet with you. We'll be conducting a series of strategy feedback sessions around extending virtualization to meet the needs of tomorrow's big data analytics environments.
And if you have an opinion on these topics, why not share it? You just might help create some really great products down the road ...
How We Got Here
VMware has been busy extending its technology to bring the benefits of virtualization to popular Hadoop distros via Project Serengeti. But now the VMware team faces some interesting directional choices, and is looking for feedback.
We ran a similar series of storage-related strategy feedback sessions at EMC World recently, and they were quite good. We'd bring in interested parties, then run through a series of hypotheses to get feedback and commentary.
We didn't always expect to hear what we heard, but it was all extremely useful.
Now we'd like to do the same thing at the Hadoop Summit on a different set of topics.
What We're Interested In
For starters, we want to get a sense of where people are in all of this: just starting out, one or more clusters in production, or perhaps even farther than that?
One of our assumptions is that there's more to one of these production analytical environments than just the Hadoop toolset -- what's the entire workflow look like?
And then there are users -- what kind of people use the cluster? Code jockeys, Hadoop admins, data scientists, business users -- and what sort of experience do they prefer?
Data Management Questions
Some use cases have the luxury of greenfield data; others are faced with the unenviable challenge of data sourcing from legacy apps that weren't designed to do this. What's it like in your shop?
One would think that information governance -- controlling access to information sources and outputs -- might be a big concern in some environments. A problem today, or just a concern down the road?
If you see multiple clusters springing up over time, would they be sharing data sets? Or are the use cases reasonably isolated?
Infrastructure Questions Too ...
If you've got Hadoop clusters up and running, where does it hurt? What would you like to see fixed?
When it comes to sizing your clusters, is there any interest in being able to easily vary compute, capacity and bandwidth? Or does the uniform building-block approach look more appealing to you?
Is there any interest in backing up these environments? Or would it just be simpler to recreate them from existing data feeds? What about business continuity and disaster recovery? How do you see your answers potentially changing over time?
And There's Probably More
It's amazing how fast 90 minutes flies by when you get to talking about this stuff. I know we're not going to be able to get to all these topics.
Does this sound like something you'd be interested in?
If so, please get in touch!
You can either drop me a line at firstname.lastname@example.org, or if you prefer, send a note to email@example.com -- and, if you wouldn't mind, please tell us a bit about yourself and why you're interested.
We'd love to chat!