The term "big data" is starting to be used for newer applications that harness enormous quantities of information to do new things we couldn't consider in the past.
Whether it's the new world of analytics, enormous data and content repositories, or something else -- there's growing interest in this new realm. Massive compute + massive storage = brave new world.
Witness EMC's recent investments in Atmos, Greenplum, and -- most recently -- our proposed acquisition of Isilon.
But, in my mind, this is much too limiting a view.
I'd argue that perhaps we should extend our thinking even further -- and not only use the term "big data" to apply to these new and intriguing use cases, but also to refer to enormous information portfolios being created by more traditional applications: databases, files, etc.
Either way, I believe the conversation changes significantly when we start discussing petabytes vs. terabytes. The sheer scale changes the problem, and your approach to mastering it.
Visualizing The Petabyte(s)
So many zeros, and it starts to get hard to visualize what that amount of information might really mean. This clever infographic might help a bit.
In the physical world, a useful simple visualization is imagining about a thousand terabyte-class drives sitting in various racks.
From an infrastructure perspective, that's a lot of cost just to store, protect and manage that information portfolio: hardware, software, facilities, people, etc.
From a business perspective, what's the economic value of those petabytes of information? Does anyone really know?
Introducing The EMC Petabyte Club
Way back in 2000, we coined the term "petabyte club" for our very first customer who had a petabyte of EMC storage in production. It was a really big deal at the time.
We now have well over a thousand customers in the ever-growing EMC Petabyte Club. They each have at least one -- or frequently many more -- petabytes of EMC storage in production.
If you've ever wondered who could possibly want, for example, a VMAX with 2000 drives -- or multiples! -- we work with these people each and every day.
And -- as a class -- they tend to think about the world very differently than someone who has, say, a more modest hundred terabytes.
The Hardware Portfolio
Storage technology is moving fast, and -- if you're in the Petabyte Club -- you pay very close attention to newer technologies like enterprise flash, dedupe/compression, automatic tiering, etc. Since you're operating at considerable scale, these developments can make an enormous financial, operational and business impact.
But there's a problem -- you need a way to introduce the new technology in a non-disruptive fashion. Enter the world of large-scale storage migrations. At smaller scale, data migrations can be thought of as an event. In the Petabyte Club, it's usually an ongoing and continual process.
Members of our Petabyte Club are usually always migrating from one storage technology to another --and for all the right reasons. It's not an event, it's an ongoing process. And, of course, the ability to do this safely, predictably and non-disruptively is a huge priority for them -- so it's something we invest in.
Even such mundane things as, say, EMC's 20% Efficiency Guarantee for unified storage become a big deal at this scale -- we're talking 200 TB (or more!) of capacity that now doesn't need to be purchased, powered, cooled, managed, etc.
Little numbers become really big numbers.
The Process Portfolio
The efficiency of operational processes (and supporting technologies) become absolutely crucial when operating at this scale.
Informal "spreadsheet and script" approaches don't scale, and storage management tools are no panacea for poor process. At this scale, serious effort is usually put into building a dedicated storage team: separate functions for strategy, design, implementation, integration -- even dedicated asset management functions.
These extremely proficient teams rarely evolve organically -- more often, there's a forcing function that represents a clear departure from the old way of doing things followed by a top-down org+process redesign.
I would argue that it's hard to really "know" storage until you've been responsible for a petabyte of the stuff. And getting 10%, 35% or 60% in process efficiency, well -- that turns out to be a very big deal at this scale.
The Quintessential Service Catalog
Not all information is created equally, and the value of information tends to change over time. The practice of categorizing information -- even at a macro level -- and mapping buckets to a distinct set of storage-related services (performance, availability, recoverability, compliance, etc.) is an essential construct for most anyone who's in the Petabyte Club.
A service catalog is *not* "well, we put this data on Product X, and that data on Product Y".
Ideally, the storage service catalog is constructed in a technology and vendor neutral manner. Once you've got a service catalog, you start looking for vendors and partners to help you deliver it in the best manner. It's about the services you deliver, less about the specific products.
Although, we'd make an argument that EMC does it best :-)
The Allure Of Metadata
Static assignment of information classes to service levels is somewhat unappealing, especially at this scale. There's usually insufficient granularity, and there's a natural tendency to overprovision service levels "just in case".
One increasingly popular theme is to use a hybrid approach: statically assign a class of information to a service level (e.g. "SQLserver gets bronze"), and then use automatic tiering and archiving technologies to further reduce costs. Automatic tiering tries to figure out the importance of information based on how frequently it's used, moving popular data to faster media, and less popular data to more inexpensive media.
Cost savings potential are directly correlated to ease of implementation. FAST, for example, is a easy-to-set-up feature in most EMC arrays, so it's wildly popular. Purchasing and implementing specific information management tools (e.g. filesystem and/or email and/or database archiving) need a bit more cost/benefit analysis -- but since we're talking significant scale, the rationale is usually there.
But there's growing interest in doing even more - especially in our Petabyte Club.
Since much of this data lives in files and objects, the opportunity exists to generate data (whether external metadata living in a repository, or tightly associated with the object itself) that provides explicit instructions on how it's supposed to be handled: performance, availability, recoverability, location, compliance, security, audit trails, etc.
The more metadata, the easier it becomes to not only automate the process, but ensure that it's being done properly -- hence growing interest in the parts of the EMC portfolio that can generate and use metadata for a variety of use cases.
For widely geographically dispersed models, that's where technologies like Atmos come in -- physical location(s) of the information become one of the primary optimization vector.
Services Are Popular At Petabyte Scale
From constructing storage service catalogs, to implementations and migrations, to advice on how to design and staff a petabyte-scale storage team, to strategies for generating and managing policy oriented metadata-based management schemes, to even delivering storage as a managed service -- there's strong demand for services and consulting engagements with people who've done it before.
Indeed, EMC has developed a broad and robust set of professional and consulting services around these themes, and customer demand is continually growing every month.
And let's not forget customer support services, either. Someone who operates at this scale needs and deserves customer service models and services tailored to their environments, and not a standard off-the-shelf help desk to call when there's a problem. Every year, we've had to step up our game for our customers and partners who operate at this scale.
Big Problems Come In Smaller Sizes As Well
Part of me realizes that the challenges associated with "big data" also emerge in smaller environments as well. If you've got a modest IT staff and budget, you're still coping with many of these issues, but can't really consider multiple storage platforms, specialized staff and associated processes, and all of that.
The opportunity -- I believe -- will be for vendors to provide the same sorts of capabilities needed by our Petabyte Club in smaller, integrated and highly automated packages that assume an IT generalist vs. an IT specialist.
After all, isn't it only a matter of time before even more modest organizations become members of the Petabyte Club as well?
Back To Big Themes?
We appear to be rapidly transitioning to an information economy: the businesses of tomorrow will be powered by the generation, manipulation and consumption of ever-growing amounts of information. And -- above all else -- I see this as the fundamental reason that so many people are interested in storage today -- especially at substantial scale.
Some of these new information sources will simply be linear extensions of what we're already doing today. Others will be entirely new use cases that start with petabytes, and grow quickly from there.
For those of you who are part of the informal EMC Petabyte Club -- we thank you for your trust
in us, and we look forward to tackling tomorrows information management challenges together.
And for those of you who are looking at your storage footprint and wondering "where the heck
does all of this data come from?" -- well, we might be talking soon :-)

it would be nice to see more *data* about the petabyte club. don't identify the customers... but it would be great too be able to graph some of this stuff...
why not make it a formal petabyte club?
Posted by: Monkchips | November 24, 2010 at 07:43 AM
Monkchips:
I agree -- it would be very interesting to understand them more fully, how they came to be, what made them different, or the same.
Unfortunately, that would be tantamount to doing large-scale market research on behalf of our competitors. We had to learn about these customers' needs the hard way -- and they should too!
-- Chuck
Posted by: Chuck Hollis | November 24, 2010 at 09:02 AM
I will give you some insight where some of this data is being created:- http://www.hpl.hp.com/news/2009/oct-dec/cense.html HP is developing both the instrumentation and infrastructure to create and harness these large data sets. We know where the verticals and applications that generating these Petabytes and the slides on Isilon give some clue.
Posted by: Andy Sparkes | December 03, 2010 at 06:50 AM