Some things in IT never change.
When I was a young pup in IT back in the 1980s, backup was a painful problem. Fast forward 20+ years, and it’s still a problem.
But new technologies can fundamentally change a familiar landscape. As an example, the introduction of VMware’s ESX server in 2001 fundamentally changed how we look at servers going forward – it’ll never be the same.
Today, I’m going to argue that 2007 will mark the beginning of a transition period for this whole topic, at the end of which we just won’t look at it the same way.
It’s not a single product or technology in this case; it’s a combination of things that – taken together – can fundamentally shift our whole thinking around this subject.
Context
What started simple enough (make a safe copy of your data so you can get it back if you need it) has turned out to be damnably complex.
Data tends to grow exponentially. Backup windows tend to shrink to zero. Recovery times want to be as short as possible. Costs (equipment, bandwidth, people) need to be contained.
And the granularity of the information to be protected continues to complexify. You want to protect just these files, just this application, just these applications taken together, just these business processes, just these data centers, and so on.
And I don’t think that anyone is predicting that any of these trends is in danger of reversing anytime soon, right?
Technology tries to keep up by being faster, cheaper, etc. but what we really need here is a Disruptive Change or two. And it's happening.
The first disruptive change – from backup to recovery to repurposing
New approaches come from wider lenses that connect the dots in new ways, and this topic is no exception.
The first discussion was around backup itself – how fast, how to minimize application disruption, how to reduce costs, and so on.
Along the way, we came up with incremental backups, hot backups for files and databases, clever use of backup servers to speed data movement, using disk as a target to speed backup, and so on.
Early use of disk replication technologies in the 1990s presented a powerful, though expensive, alternative – make a disk-based copy using the array. Whether locally or remotely, the perceived impact from the application was minimal – just enough to hold I/Os for a brief instant while the array did its work.
This is a pretty mature discussion, with one major new development – data deduplication. Just as full backups gave way to incremental backups (faster, cheaper), incremental backups will give way to data deduplication backups (faster, cheaper), but with some important differences, noted later.
But taken through the traditional “I’m only concerned with backup” lens, data deduplication can be thought of the next step in improving traditional backup issues: runs faster, less bandwidth required, far less storage on the backup device, and so on. I’m selling it short here, but that’s one aspect.
Before too long, the discussion shifted from backup to recovery.
In retrospect, it’s blatantly obvious – after all, the primary goal of backup is getting your data back, but for many people this was a serious shift of perspective, and it continues to this day.
The balance of power shifted. As an example, sure, incrementals were cheap, but it could be a holy terror doing a complete recovery: one full restore followed by umpteen zillion incremental restores. Disks became more attractive because they could speed recovery time whether it was a full or incremental restore.
Along the way, things like backup reporting became more important – you needed to know what worked, and what didn’t. Backup applications learned to target disk, whether it was through direct access, or through a virtual tape library.
Checking databases for logical consistency became more important – no sense in backing up a scrambled database or filesystem. The whole topic of consistency groups (congroups in the patois) – interdependent applications that had to be logically consistent – became more important to more people.
But now, the discussion is shifting again – from backup to recovery to repurposing
This shift is being driven by a simple idea: if you have a copy of data lying around, it can be put to useful work.
We first got exposed to this during the late 1990s by customers working with TimeFinder, which makes one or more independent copies of data using the processing power of the Symmetrix.
We found that customers would take that second copy and do all sorts of useful things: run reports, load a data warehouse, do testing, and so on – and do it in such a way that it didn’t impact the production database. Now, the costs involved limited the appeal, but we saw people come up with very clever ways to use an additional copy.
Fast forward to today: there’s strong interest in active archiving, enterprise search, compliance, e-discovery, knowledge management and so on. All of which can be thought of as a repurposing of information beyond the intent of the creator.
Put differently, as long as IT has to spend money to protect data in order to recover it for production, why not get additional value from this data elsewhere in the organization?
The classic example of this was the evolution of email archiving. The first round of deployments were all about saving money: make the production instance smaller, make backups easier, servers smaller and so on. The second round was driven mostly by the legal department’s need to treat email as a business record. And the third round is being driven by the realization that historical email can be a valuable source of information that can be repurposed in a variety of ways.
I think this widening of perspective by most IT organizations will set the stage for this next evolutionary period. Call this the “informationist” view …
More importantly, it drives the technology vendors to think about things in new ways.
The second disruptive change – the technology begins to fall in place
But there are some important technology barriers to consider in this move from backup to recovery to repurposing.
First, tape devices aren’t really usable for any information repurposing application – they’re just too damn slow. Disk has to get cost effective enough that it can replace most of the tape in the environment.
Second, the tape backup format itself hampers alternative uses of information – information has to be readily available in native format for most repurposing applications.
Next, you’re going to need a new layer of software functionality – software that can pick up and look at a piece of information outside of the context in which it was created, and figure out what it means and what to do with in. Is this object important, or not?
Let’s take these one at a time …
The first barrier is making disk cheaper than tape.
Face facts – at a cost per usable MB level, tape will probably be always cheaper than disk. Yes, disk prices continue to fall, but so does tape.
And, yes, you can construct TCO arguments that show disk is cheaper than tape once you add in all the externalities, but no one likes seeing those presentations. And then there’s the growing energy issue – a tape cartridge sitting on a shelf consumes absolutely no power.
The game changer here is data deduplication, and I’ll use EMC’s Avamar as a case in point. Pick your favorite reduction factor: 50 to 1, 100 to 1, 300 to 1, 500 to 1 – whatever, it doesn’t matter. You don’t have to do too much reduction before it becomes painfully obvious that disk media becomes cheaper than tape media.
You can’t use tape as a target when considering data dedupe – recovery times would be measured in days or weeks as the incremental bits would be scattered over thousands of tape cartridges – you’re forced to use disk.
Not to wax too poetic here, but backups run faster (less data being transferred), network costs are less (again, less data), and so on.
The downside, of course, is that you’ve got to go rip and replace your traditional, tape-oriented backup application. And, as we’ve seen before, just because it’s better doesn’t mean that people will do it.
The second barrier is ditching tape backup formats to make the information available for repurposing.
Just so we’re absolutely clear, if I want to use information for search, compliance, new apps, etc. I shouldn’t have to locate the relevant tape-formatted data object, bring it back, crack it open, save the bits I want, and throw the rest away. The information should be stored in native format, immediately accessible to any downstream use you can think of.
Two approaches here – intercept before backup, and backup in native format.
Intercept before backup can also be called archive before backup. EMC has done a ton of this in email, file and database environments – use a specialized application to sift through the primary repository, find things that are candidates for moving, yet preserve the “big view” of the environment that users want to see.
Everyone understands why this makes sense in an email environment, but people are also waking up to the fact that file systems are a frequent offender. I’m thinking of creating a new category of storage – write-only disk – because it seems that the vast majority of files are written and never read again.
Backup in native format (always to disk) means keeping a complete image around of our application database, email instance or file system. Full copies are expensive (but are in use), pointer-based copies are a bit more popular (snaps, etc.), but still expensive.
Once again, server-side data deduplication (e.g. EMC Avamar) changes the game. Backup images are presented as mountable file systems that sequence according to time: today, yesterday, last week, etc. All the information is available in native format for immediate use. And, as we pointed out earlier, it’s actually cheaper than tape.
To be fair, there are other data-dedupe technologies out there (e.g. DataDomain and others), but I draw a sharp line between those that present their backup images in native format, and those that don’t.
Again, the same barrier is resistance to change. Anytime you substitute native format for tape format, it’s a rip and replace proposition for your backup application.
With the first approach (intercept before backup) I think there's more application work to do, but your backup environment remains intact (and more productive). With the second approach (backup in native format), there's less application work to do, but the traditional backup approach has to go.
Finally, if you’re going to repurpose, you’re going to need software tools that do this.
The basic motion (also called intelligent information management, or IIM) is to scan an object (file, email, database record, image, etc.) and use a rich taxonomy and rule base to weed out what’s important, what’s not and why. EMC’s offerings in this space for email is EMC emailXtender. For files, it’s EMC Infoscape.
Now you’re free to do different things – identify candidates for deletion or retention, load up important bits into a content repository like Documentum, feed a search engine if you like, build new applications like e-discovery, collaborative workflow and so on – the list is endless.
The third barrier – IT assuming responsibility for repurposing information.
It’s pretty clear who’s responsible for backup and recovery – it’s IT. But information repurposing? That’s not so clear.
A lot of it has to do with IT funding models, and how they’re evolving. The classic model is projects funded by business owners. If you’re smart, you try and save a little for infrastructure projects that need to be funded outside this model.
But will IT recognize the need to fund information repurposing capabilities?
No business owner is likely to show up at your door and say “hey, I’ve got this vision, and here’s the funding you’ll need …”.
So how will it get there?
The primary way I think it’ll get there is an add-on. As people get drawn into the magic of server-side data-deduplication for cost and performance reasons, they’ll realize they’ve bought a two-fer: not only do you have better backups, you’ve got a stellar platform for repurposing, if you choose.
Kind of a replay with what we saw with TimeFinder back in the 90s, but on a whole different scale.
The second way it’ll get there is what I’d call application synergy. At some point, an IT guy will be looking at his list of projects and see email archiving, enterprise search, knowledge management, e-discovery, maybe some collaborative BPM and a light will go on, and say “gee, these are all really different aspects of the same thing”, and connect the dots.
Again, an informationist view. I can only hope.
The third way (and cynically probably the most popular way) is second surgery. All these different technology stacks will go in disparately, and we’ll find a good opportunity to come in and make sense of the situation.
I’ve left a lot on the table in this discussion
I apologize for the length of this post, but there's even more to talk about.
I haven’t even brought in the RPO/RTO discussion. For those of you who aren’t acronymists by trade, that stands for Recovery Point Objective (just how much data would you like to lose?) and Recovery Time Objective (just how long would you like to wait before it comes back?).
The first leads you to discussions around CDP (think Kashya, now EMC Recoverpoint), the second leads you to tiering of technology and recovery media.
I haven’t even brought in the remote aspect – recovering from a distance, or multiple disparate locations. This leads you to all sorts of interesting discussions about distributed parity, hub-and-spoke topologies, and the like. Fun stuff.
And I haven't dug into how I think this technology will be packaged and deployed (integrated offerings? intelligent SAN switches? baked in as part of the array, file system, operating system or application? consumed as a service, rather than as a product stack?) Lots to discuss here, for sure.
Hey, if you’ve made it this far, it’s confusing enough, without me going off into alternate dimensions. But, never fear, I'll come back to this periodically and shine a light on a few of these related discussions.
Stepping back a bit
Backup really sucks, and it’s been that way for a while.
I think we’ve begun to enter a new area where all the pieces are coming together for a fundamental shift.
As people build their information infrastructure during the coming decade, I think this is an integral part of the discussion, and here's why:
- The lens is widening from backup to recovery to repurposing.
- The core technologies (data deduplication, intelligent information management) are falling into place.
- And, ultimately, more and more IT organizations are realizing that they need to take responsibility for the company’s most important asset – information.

Just a quick note on rip and replace with respect to Avamar. The fact that Avamar can expose the backup data as mountable filesystems allows customers to use Avamar to centralise their backups and then spool it all off to tape via their existing tape backup infrastructure.
No dead ends whatsoever.
And of course that's one of the overlooked wins with Avamar as you've pointed out, is that's real data it's showing, not a clump of data written in a file format readable only by the backup app which wrote it.
The more I see what EMC is doing with it the more I'm convinced EMC got a bargain.
Posted by: Storagezilla | January 05, 2007 at 03:43 PM
You're right -- I had forgot!
A two-step is possible, and I think more than a few people may start with that route, but -- as we both point out -- it destroys the value of the resultant backup image for reurposing exercises.
I think we got a hell of a bargain.
Posted by: Chuck Hollis | January 05, 2007 at 03:48 PM
Great blog! I want to point two things
1. Presenting backup data in the native format is something backup vendors have played for long time. I believe it was the characteristics of the backup media which did not allow this feature to be exposed to general application (surely, it can be use to do recovery itself). These characteristics are
* Access time
* Sequential nature of the media
Now, if you use the disk as the backup media, lot of possibilities open up. One of things is the above. Only full copy will preserve the native format of the data and you could access it w/o having any additional software in the I/O path to translate the request. If you have tape format, incremental backup pieces, single instancing, you will need an additional software/protocol to expose the data in the native format.
2. Repurposing is a great idea. Read-only access does let you run some repurposing apps. But many would require write privilege to the copy. You don’t want to do that unless you have some protection for your copy!
Posted by: Kumar | January 10, 2007 at 06:57 PM
Hi Kumar
As far as your first observation, I would disagree a bit. As far as I can tell, traditional backup applications store objects on media (tape or disk) that are not directly usable by the application that created them, unless mediated or reconstructed by the backup application.
Snaps, clones, etc. are different, in that they are directly usable without this interpretation and/or mediation.
As far as your second comment, I don't see it as much of a problem. Many of the repurposing applications we encounter (search, workflow, compliance, knowledge management) are very happy with read-only copies of data, and -- as you point out -- that's the ideal state.
For those repurposing apps that require a (separate) writeable copy of information, existing file system snaps should be sufficient, and readily integratable with a data dedupe file system.
Thanks for the comment, and thanks for reading!
Posted by: Chuck Hollis | January 11, 2007 at 08:33 AM
Sounds like an EMC/Avamar pitch to me. With Avamar/EMC you must REPLACE your backup software. With other solutions such as Datadomain, you do not. The less disruptive solutions are the ones that don't require a full rip and replace of backup software. All deduplication is not equal also. The trick is not really in the deduplication it is in doing it with reasonable speed. I suggest you do some homework on Avamar vs others such as Datadomain.
Posted by: Blogger182 | May 19, 2007 at 11:10 AM
Hello "Blogger182" -- what a clever name for a Data Domain employee ...
So, yes, with all client-side data dedupe, you have to replace your backup client.
That's part of the deal.
However, I believe that client-side dedupe (such as Avamar) vs. target-side dedupde offers a few significant advantages (as shown by our internal testing of products such as Data Domain's).
1. Backup windows are much shorter -- you're only sending the changed bits across the network, rather than everything as with target-side dedupe. This can be pretty dramatic in many environments -- minutes instead of hours. We've put both approaches head-to-head, and it's not a fair contest.
2. Network traffic is far, far less -- for the same reason. Again, not a fair contest.
3. Compression rates are seriously improved for client-side dedupe, as comparisons are done against all data ever seen in the environment, rather than just what's in the latest backup stream.
4. And finally, the backed up information is stored in a native file format, which means it can potentially be used for other purposes (search, compliance, archiving, etc.) rather than locked away in an unusable tape blob on disk.
That being said, I think both approaches will find their homes in the market.
There will always be people who don't want to do heavy lifting (such as replacing their backup client), and are willing to give something up in the process, and they will find target-side dedupe products (like Data Domain's and others) more to their liking.
And, yes, I did my homework. Thanks.
Posted by: Chuck Hollis | May 20, 2007 at 02:06 AM