The recent introduction of data deduplication technologies by many vendors (including EMC) have opened up a whole new discussion with customers on how to store information more intelligently -- by minimizing the waste.
I thought it would be useful to step back a bit, and look at the entire landscape of data reduction technologies, and try and offer a bit of a perspective.
Trust me, this is just the opening salvo in what will likely be an extended debate for quite some time, so let the opinionating begin!
Goal: Cut The Waste
We all intuitively know that much of the information we manage is stored ineffectively -- there's different forms of waste and redundancy.
And there's a growing landscape of approaches to directly attack this problem.
I've fallen into the habit of using the term "data reduction" to refer to all the cool technologies (compression, single-instancing, data deduplication, maybe even thin provisioning) that can help you do more with less.
Not surprisingly, each has its strengths and weaknesses.
And, not surprisingly, none of them represent a "silver bullet" in magically reducing storage requriments by a significant order of magnitude. But I think all of them will find their way into different parts of the landscape.
The trick will be to use the right approach in the right place, and -- since the technologies are evolving rapidly -- it'll certainly be a fun discussion for quite a while.
So let's start our tour ...
Compression
Most folks are familiar at some level with compression, usually done by looking for redundant patterns in the data stream, creating a code book on-the-fly, and squeezing redundant information from a single storage object, like a file or a backup stream.
Most people are also aware that your mileage will vary dramatically depending on what you're compressing. Sure, it's easy to find types of data that compress easily, but that's changing over time.
More and more of our information is rich content (images, video, voice, etc.) and that's usually already stored in a compressed format. And if you've ever tried compressing things twice, you know it's a pointless exercise.
More and more applications (e.g. PowerPoint) have taken to compressing their native file format. And we're starting to see more efficient use of space in databases and the like.
And, yes, there's a performance penalty for compression, but with faster processors, that's becoming less of an issue. As an example, EMC's Disk Library does compression in the controller appliance, and it doesn't seem to impact performance that significantly. And, like the tape compression it replaces, your mileage will vary.
But there's a point here -- compression (when it's useful) is only valuable against the specific file (or data stream) it's compressing. Specifically, the code book that is generated in a compression session only looks at what's immediately in front of it. There's no ability to spot additional redundancy against all files or data objects, or to look at changes over time, as with backup.
Which leads us to ...
Single Instancing
One of the coolest features introduced with Centera was single instancing.
If two or more people saved the exact same object, the hash codes produced would be identical, meaning that the object would only have to be stored once.
Turned out to be very useful for things like email archiving, where someone would send a 20MB powerpoint or video to 5,000 of their co-workers. Email or file archiving software would archive the object, and -- of course -- the object would be identical.
Turned out to be a great way to gain significant data reduction in environments where there were potentially lots of copies of the exact same files floating around -- powerpoints, video, images, etc.
Yes, people would make copies on their local file system, but when it was archived, there were great cost savings. And, of course, producing a hash code can be computationally intensive in some situations, so you don't see blazing write speed at the high end.
But, that not withstanding, there are literally thousands and thousands of Centeras out there today, merrily performing data reduction (through single instancing) without any fuss.
And -- conceptually -- this approach covers the two dimensions very well: space (all files and objects in the enterprise that it can see) as well as time (all previous instances of an object). So it has a very large domain to work its magic.
But it's best to think of single instancing as another tier of storage. The performance and cost characteristics are different.
Now, if someone were to open one of those saved powerpoints, make a few changes, and save it again, well, that'd be a separate object, and would have to be stored separately.
Hence the interest in ...
Data Deduplication
I can see it already -- we'll be talking about this one for some time.
The idea behind data deduplication is to exploit the simlarity within data objects when considered across a larger domain.
Without being too precise, data deduplication scans looking for "chunks" of data it's seen before, and then substitutes a small reference to a much larger (redundant) chunk of data.
So, back to our example of the person who modifies and saves a powerpoint, only the modified bits would be seen as distinct. Nice trick.
From a pure technology perspective, you'd want three major things.
You'd want the "chunk" size to be variable to exploit redundancy in any object at any point. Fixed-length approaches would miss a bunch of opportunity.
You'd want the domain of files and objects to be as large as possible -- the greater the space, the bigger the potential impact. If you're only looking at a single PC, or server, or file system, the impact is reduced.
And finally, you'd want the function over the time domain -- all versions of this object that were seen at any time.
Put differently, think backup.
Keep in mind, in general for every 1 GB of production file system, there's anywhere between 2.5 GB to 8 GB (or more!) of backup images floating around. So there's an order-of-magnitude bigger problem to go tackle in the backup arena than the production arena. Hence the interest in the time domain.
The opportunity for data dedupe is much greater when it can address all three issues: variable chunks, across all files, over time. Limit it to fixed chunks, subsets of files, or production only -- very reduced impact.
Incidently, this is where dedupe backup vendors get away with claiming seemingly outrageous data reduction rates (such as EMC Avamar) because they do all three things, unlike incremental backups.
Now, a couple of caveats. First, EMC believes that dedupe backup is dramatically more effective when done at the client side (using Avamar, for example) as opposed to the target (as with Data Domain and a few other newer entrants).
Why? You don't have to transmit the object to realize that you've seen it before -- backups are smaller/faster, networks don't get hosed, etc.
Second, there's no free lunch. Identifying and exploiting partial data redundancies in files is computationally intense, so there are more than a few use cases that this won't work for. And the approach hasn't been proven to work on things like mission-critical transaction databases (yet!)
And as a final caveat, many archives are used for compliance purposes, which means you can legally prove that the data hasn't been molested since it's been stored. I don't think that de-dupe approaches will find it easy to get the legal approval for compliance use cases, but that's just an opinion.
Recently, NetApp announced their intent to make data dedupe as part of the production file environment. Nice idea, but I think it won't be as glamorous (or impactful) as deduping the backup stream. And if you're going to be deduping the backup stream, the best place to do it is at the client, not the target.
And I don't think there's any way in getting around the performance issue -- writing directly to a dedupe file system will be s-l-o-w. No free lunch, unless you've got a ton of processor and memory to throw at the problem.
This means that you have to hide this from the user by doing it at some later time, which introduces some new issues.
Simply put, it'll be another tier of storage. So it ain't production.
More importantly, unlike the backup example, you only get to exploit data reduction in one dimension (space) rather than two (space and time). And if it's only limited to a single, small domain, impact will be reduced yet again.
Put differently, you probably won't see the amazing reduction rates for this type of data dedupe, and it'll be highly dependent on the types of files you're storing.
If you have many instances of the exact same file floating around, single instancing for "production" will be fine. Data dedupe will only be effective if there are many slightly different versions of the same file floating around.
Going a bit farther, there won't be a global file space to find redundancy -- you'll only get to spot redunancy on a single filer, not across multiples. The larger the file space; the better the opportunity to exploit redundancy.
Incidently, as part of the Avamar sales cycle, there's an honest discussion about what we've seen in terms of data reduction in production environments. Some numbers are small (e.g. 20-1 or so), others seem to be outrageous (200-1) until you think about it a bit.
And, not surprisingly, as part of the Centera sales cycle, we share what we've seen from production environments in terms of storage savings. Yes, it's a broad range as well.
As always, your mileage will vary.
But, it'd be nice to see some actual data reduction numbers for production dedupe at some point.
That being said, I would expect all the storage vendors to be working on something like this.
A nice-to-have feature -- if used properly.
Which brings us to ...
Thin Provisioning
I don't know whether this belongs in the discussion or not.
Yes, it technically is a form of data reduction for unused space. But I have an issue that sometimes is used as a way to mask poor storage management practices, and -- at the end of the day -- you're lying to users. I have a problem with that.
I've written before about how I think thin provisioning is a two-edged sword, and can create more problems than it solves. We've had it in one of our products for a while (Celerra) and we've had to take exceptional steps to make sure that users don't get bitten by poor performance through improper configuration, or a dramatic crash when an app tries to do a write and storage device says "no can do".
I loved Barry's comment that we screwed up naming in the industry -- maybe we should have called thin provisioning by its more accurate name -- "storage virtualization". After all, that's what server virtualization, and memory virtualization do -- make it look like you have more than you really do.
Maybe all of these data reduction technologies should be renamed "storage virtualization".
Well, that's not going to happen, is it?
Wouldn't It Be Great?
One of the problems here is that these are all different technology stacks that all have their role, and customers will want to use all of them in a more simple fashion.
Wouldn't it be great if there was a file system that understood your information, kept persistent versions of everything, and used the right combination of data reduction technologies at the right time?
Some version logging here, a little compression there, a little single instancing over here, maybe some dedupe thrown in .. .all seemless and transparent to the user. It'd be cool, and solve a real problem
Well, we don't live in that world -- yet!
Bringing It Full Circle
Compression.
Single Instancing.
Data Deduplication.
Maybe even thin provisioning.
All different approaches to use less storage, or at least use it more effectively.
Which is why I call the category "data reduction".
So how does an IT user decide what to do?
Each has its pros and cons. None are a silver bullet.
It's a familiar discussion -- you need to understand your data. What is it, how it's represented, how it's used, how important is it -- which leads us to classification tools.
Landing the wrong kind of information on the wrong kind of technology won't be pretty. But I'm sure it will happen anyway.
I think this will be an extremely popular technology debate in the near future.
Lots and lots will be written about it. And the avalanche of opinionating has just started.
Hmmm, maybe we'll need these data reduction technologies to store it all ;-)

Comments