Industry memes come and go, but industry markets are tangible things: they can be measured, segmented, analyzed and hotly debated by the participants.
Subjectively speaking, widespread usage of the term "big data" started in earnest at the beginning of 2011. One short year later, it appears made the important transition from meme to marketplace.
Analysts are weighing in. Interesting subsegments are emerging. Ecosystem vines are actively growing, thriving and intertwining. Various customer audiences are raising their hands and expressing strong interest in one set of topics or another.
All very encouraging from where I sit.
I thought I'd use this post just to bring you up to date on the current state of our industry "big data garden" and how it's growing ...
The Analysts Weigh In
You're not in an official "industry market" unless at least one industry analyst does a detailed sizing. Well, we now have two, both with slightly different perspectives.
The first serious take on the big data market came from Wikibon (free!). Their lens is mostly focused on large-scale analytics: hardware, software and services, and excludes other big data use cases that many of us find interesting as well. That being said, there's a great take on market growth, segmentation, revenue of the "pure play" vendors, etc.
All good grist for the mill, and the price is right :)
Another view comes fresh from the folks at IDC, whose perspective is predictably different -- also, much closer to my own personal world view. Unfortunately, access isn't free for most, but I'll take a moment to share what I found interesting.
First, their definition is much more encompassing and thus more intellectually appealing.
- Deployments where the data collected is over 100 terabytes (TB). IDC is using data collected, not stored, to account for the use of in-memory technology where data may not be stored on a disk, or
- Deployments of ultra-high-speed messaging technology for real-time, streaming data capture and monitoring. This scenario represents big data in motion as opposed to big data at rest, or
- Deployments where the data sets may not be very large today, but are growing very rapidly at a rate of 60% or more annually.
This definition thus encompasses an incredibly wide range of relevant use cases, from Facebook to ostensibly such mundane things as a big set of home directories with a fast growth rate, or wildly successful VMware farm :)
Their "stack" is useful to consider as well: the predictable infrastructure components (server, storage, network), data organization and management (things like Hadoop), analytics and discovery (R, SAS, et. al.) and finally decision support and automation to operationalize the factory.
With that out of the way, they peg the market at about $6.8 billion this year, with some breathtaking CAGRs including (wait for it) a 61% annual revenue growth rate through 2015 for storage.
They, too, point to the same industry skills gap inhibiting market growth as we at EMC have, so at least there's confirmation that we're not just making this stuff up.
The entire document is worth reading in its entirety, so if you have the $$$ and the interest, I'd recommend it. As newer versions are published, I hope to come back and chart the changes in their forecast.
The Ecosystem Grows
Another insighful development was the recent announcement by our friends at Spring introducing the first version of the Spring Hadoop Project.
Why is this interesting? If we refer back to my "big data analytics proficiency model", it has three distinct phases.
In the first phase, the organization invests in making all internal data sources easy to consume and easy to experiment with. Call this first phase "BI as a service" if you will.
The second phase is "bring in the data scientists to do their magic", but -- of course -- the first thing they're going to want is easy access to all the internal data sources.
And, finally, develop newer applications and workflows that operationalize the predictive models developed by the data scientists.
That third phase means substantial enterprise application development, and that's where Spring fits in.
Imagine, for example, a new mobile app that gave real-time insurance quotes based on real-time predictive analytical models. Not that any of our customers would want anything like that :)
Also in the perhaps-you-didn't-notice category: the Greenplum folks quietly made some important enhancements to their database environment as well. Of particular interest is a high performance version of gNet for Apache Hadoop environments (basically the interconnect optimizer for large numbers of shared-nothing compute nodes), as well as the somewhat predictable Data Domain Boost for Greenplum -- using the power of the compute nodes to dedupe and accelerate backups.
Yes, there are a growing number of people that realize they need to back up these *very big* environments. The current figure of 173TB in eight hours is nothing to sneeze at, but we eventually need to be in the petabyte-per-hour range. More work to do.
Also in the watch-this-space category is the first version of the Greenplum Command Center: clearly targeted at the IT professional who has to keep that big data analytics factory humming along.
A Successful Way Of Getting Started Emerges
So often, there emerges an interesting impasse on the journey.
On one side, there are a small group of passionate people who "get it" and are arguing for an investment around big data analytics and data scientists. On the other side, a much larger and somewhat more skeptical group who are looking at the big numbers involved, and are predictably hesitant.
Showing the latter group the mind-bending value of what data scientists can do -- and doing so without a substantial investment -- is becoming an increasingly popular customer engagement as part of the journey.
We bring the data scientists, you bring the data and some interesting questions -- and, together we find amazing nuggets of gold that makes almost everyone "get it" and turn into enthusiasts.
The team of in-house EMC / Greenplum card-carrying data scientists we've assembled is compelling in its own right.
We've packaged the experience under the moniker of the Greenplum Analytics Lab (really more of time-bounded workshop). Although our delivery capability is somewhat constrained, these engagements are starting to routinely produce magic in the right context.
I sort of had naively assumed that deep vertical expertise was table stakes to do effective data science in most domains. I was dead wrong.
Part of the inherent value they bring is that, because they're not intractably close to the problem: they step back and let the data do the talking.
We've had more that one of these you've-got-to-be-kidding-me experiences in our own business at EMC.
For example, one of the things we care about in storage is disk drive availability. A well-armed data scientist came up with a far better predictive model than our considerable team of domain experts, and did it in almost nothing flat. Watching the interaction between the industry professionals vs. the new breed of data scientist -- priceless.
We were consequently hooked, and now there's all sorts of budget available for more data science and more data scientists.
The same thing is starting to happen in our customer environments as well. The outcome of the Greenplum Analytics Lab typically results in one or more "you've got to be kidding me" insights, usually from people who might have the deepest and narrowest expertise in a particular field.
There are some interesting human cognition factors in play here that are worth exploring down the road as we get more data points, but we'll save that for later.
Data Science Summit 2012
As part of EMC World, we're holding the second instantiation of the Data Science Summmit on May 23-24 in Las Vegas. The first gathering exceeded all expectations in terms of attendance, cool sessions on what people were doing with data science, and generally celebrating data scientists as the new intellectual heroes of the information age.
The sad part: I'll have other duties at EMC World during that time, but I am going to make every effort to attend as many sessions as humanly possible.
If you've got the data science bug like I do, you'll make every effort to attend.
The Academic Connection
Our viewpoint is that, with regards to data science, there are two distinct perspectives.
First, we believe that data science represents a legitimate and independent advanced curriculum -- at the graduate and post-doc level. Second, we believe that data science will emerge as one of the primary intellectual tools in so many established academic disciplines: from biotech to social science.
A while back, we introduced our own introductory course work along these lines, which as turned into an incredibly popular offering.
We've also started to make progress in engaging around academic leaders along these lines.
Representing the East Coast faction, we're working with the MIT Center for Digital Business, led by Andrew McAfee and George Westerman, who are studying how organizations are using big data analytics, and how their business performance differs from their peers. And representing the West Coast, we're working with the Stanford Social Data Lab, led by Andreas Weigend, mostly focused on the new treasure-trove of social data, and how to better exploit it.
Much more to do here, but I can see demonstrable progress.
Sharing Our Internal Stories
Here at EMC, we're now actively infected with the big data analytics bug.
We're working to make the first phase (BI as a service) more broadly available inside of EMC, and we've had a number of internal Greenplum Analytics Lab engagements in various strategic locations across our internal EMC organization.
Yes, we're a technology company, you might expect that -- but our starting point is perhaps more interesting: we haven't been the most analytically enabled company -- but wait a year or so, and we show every sign of becoming one.
For that reason, I think we here at EMC will serve as an excellent example over the next year or so detailing how a large, global company in an incredibly competitive business changes the way they do business using big data analytics. Hopefully, our experiences will help others.
I have good reasons to be optimistic: for example, we've already done that with social. We've now clearly reaping the benefits of our internal ITaaS transformation, remaking EMC IT into a competitive internal service provider. We've started to tackle mobility in a very strategic and thoughtful way.
And, of course, quickly learning to wield the new tools of data science and big data analytics.
The best news?
I think these appear to be very exciting times indeed for all of us as career IT professionals.