« EMC? Server Management? | Main | Chris Mellor Is A Smart Guy »

October 31, 2008

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451be8f69e2010535c84ece970b

Listed below are links to weblogs that reference Why Don't We See More Source-Side Dedupe?:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Scott Waterhouse

Let me add two more reasons:

1) Inertia. Never underestimate the unwillingness of backup admins to change. Backup applications are the stickiest application I have ever seen. People change mail systems more often. So it takes a disruptive event (like VMware) for it to be a real consideration.

2) We are the only ones really championing source side dedup. Symantec more or less gave up the ghost and went target side with PureDisk, and most other vendors just don't "get it"--they think source is the root of all evil (like DD) or worse, just generally pointless. So we need to do a lot of educating.

Steven Schwartz - The SAN Technologist

I think it is already there, we've just forgotten about it (source based deduplication that is). It is already happening at the application layer in many instances. It occurs with database based on linked records, Window's has it built-in on certain server platforms for the file & print services. Exchange has been using it for a few years.

However, Chuck, you hit it spot on. Source based deduplication typically limits the data set. There is serious talk about the possibilities of having global deduplication, however, I still feel that certain data sets should NOT be deduplication, and certain applications should start having this functionality built in, in a more intelligent manner then just realizing that there are duplicate block (block sets).

I still consider deduplication as a feature in it's infancy. I also have the best deduplication theory in development...break every data set into binary, then I can store everything as just a single 0 and single 1. (reading the data I'm still working on)

Sudhir Brahma

Is this a subject more in the realm of Applications and the way they choose to store data they generate, rather than being something to do with storage directly? There could be such utilities that implement source side dedupe and applications can subscribe to their APIs and use them when required, in a collaborative way. Doing source side dedupe across LAN and across storage boxes may be ok for some dedicated applications where the location of the base data is always known and well controlled and the application is aware of such dedupe implementations.
Dedupe across LAN/WAN inherently could add some degree of uncertainty: Data now is no longer autonomous and stored in one reliable location. Instead, a series of links to various such data blobs (stored across storage boxes across LANs) are also required to qualify the storage picture. These links are vital and also need to be stored just as reliably as the base data itself. This loss of autonomy and consequent increase in dependence on various other storage entities to construct the storage picture, could be a cause of concern for some applications.
Target side dedupe could be more assuring, since the storage box is “aware” of and actively involved in providing a consistent storage picture to a consumer/application. Also, for now, target side dedupe is restricted to a storage device and the scope is probably much less expansive than it is in a LAN.
regards
sudhir.brahma@gmail.com

David Vellante

As the saying goes, "Backup is one thing, recovery is everything." Users should make sure they understand their RPO and RTO requirements and ensure the dedupe solutions they choose are aligned with those critical factors. Thanks. - Dave from wikibon.org

W. Curtis Preston

I'm sure it's no surprise I've got something to say about this. First, I'll give you MY answer your and George's question.

I agree with Scott that the primary reason is inertia. It just takes a long time to turn the direction of the backup ship. (The aircraft carrier I served on had a turning radius of over a mile.)

My other reason is that source dedupe is primarily for remote sites, and most people are focused more on the central sites. They tend to ignore the remote sites, so it causes a greater level of inertia.

My third reason is the flip-side of the second reason. It's not JUST that source dedupe really helps back up remote sites. Source dedupe also doesn't play well in large data centers. The backup speeds (and more importantly) the restore speeds are simply not YET up to the speeds that today's large data centers need, leading them to target dedupe for the data center (where the bandwidth also happens to be less of an issue).

Finally, I do feel the need to correct some Scott said in his comment: "We are the only ones really championing source side dedup. Symantec more or less gave up the ghost and went target side with PureDisk."

That simply isn't true. First, there are several smaller vendors that are championing it. You are the only major disk vendor to do so, but that's no surprise since you own Avamar. As to Symantec giving up the ghost, that's complete nonsense. Their technology is equivalent in many respects to Avamar. They are a source dedupe product through and through with global dedupe, replication, etc. The "target side" bit is that they chose to prioritize expanding it's use as a target side dedupe for non-Puredisk backups over making it more seamlessly integrated with NetBackup. EMC, OTOH, chose to prioritize making the NetWorker/Avamar relationship more seamless before doing other things. They have not given up the ghost and I've actually personally witnessed some very large Puredisk deployments.

Chuck Hollis

Thanks, Curtis!

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Chuck Hollis


  • Chuck Hollis
    VP -- Global Marketing CTO
    EMC Corporation

    Chuck has been with EMC for 13 years, most of them pretty good.

    He enjoys speaking to customer and industry audiences about a variety of technology topics, and -- of course -- enjoys blogging.

    He lives in Holliston, MA with his wife, three kids and three dogs when he's not travelling. Chuck enjoys piano, mountain biking, boating and skiing -- in that order.

    Warning: do not buy him a drink when there is a piano nearby.

General Housekeeping

  • Frequency of Updates
    I try and write something new 1-2 times per week; less if I'm travelling, more if I'm in the office. Hopefully you'll find the frequency about right!
  • Comments and Feedback
    I'm going to be approving comments before they get posted here. Any information you can share about who you are, how to contact you, what you do for a living, etc. would very much be appreciated.