George Crump asks that question on Byte and Switch today, and I thought it'd be an interesting topic to explore for a moment.
Because he has a point -- when you have the luxury of stepping back from the situation, it makes obvious sense -- almost too much sense.
Source-Side Dedupe Explained In Thirty Seconds Or Less
Dedupe and backup are like peanut butter and jelly -- they just seem to go together.
Not only is their a surplus of duplicated data in each sequential backup job, but there's often a ton of duplicated data in what you're backing up.
Target-side dedupe means that you deduplicate the data after it hits the backup device. In this category, you'll find EMC's DL3D, Data Domain, NetApp's just-improved VTL and a few others.
Source-side dedupe means that you've deduplicated before it gets sent over the wire.
When talking about dedupe, domain size matters -- the larger a domain you're looking for duplicated data, the greater the efficiencies. We use the word "global" to describe what Avamar does, for example: it can spot duplicated data from anything it's seen before: any client, any time.
Now, that wasn't so bad, was it?
Use Cases
George is right -- deduplicating data before it's sent over the wire makes all sorts of obvious sense.
Consider, if you will, backing up remote offices. Or any other scenario where you have a skinny pipe.
The difference between locations getting backed up in minutes vs. many hours (or not at all!) will often drive the source vs. target debate. No contest, though -- the economics of source-side dedupe are so compelling it's hard to make a case to even consider target-side dedupe.
But, when we get to data centers, there might be some different choices to be made.
Some shops have already invested in heavy-duty backup LANs and/or SANs that are doing the job. That investment is a sunk cost -- unless you can see yourself running out of bandwidth in the near-term future, you can evaluate your options a bit more flexibly.
Target-side dedupe for backup has the advantage that it's just plain easy to implement.
Most of the target-side dedupe products work with existing backup products. They present what looks like a disk device, or tape library, and stuff just works. Yes, there's all sorts of intense debate about inline, post-process, or being able to switch between the two. And, of course, the yowling about who's got the more efficient dedupe algorithm.
All part of the fun in the storage biz ...
The net result is that many IT shops that have enough backup LAN/SAN bandwidth can just drop one of these puppies in, and get a quick hit. And, with busy IT people, easy has a compelling advantage.
Maybe too much so in some cases :-)
But All Is Not Grim
EMC's Avamar business is doing spectacularly. Predictably, it does very well in environments like backup over a WAN or bandwidth-challenged LAN. But, in a VMware environment, it works so hand-in-glove with VMware that many IT implementors prefer it to hacking up their existing backup environment to work with VMware.
But, at the same time, the same could be said for EMC's target-side dedupe products like the DL3D. EMC has been in the disk library aka VTL business for quite some time, we now offer an attractive deduped option to these same customers.
The Big Picture
As has been said before -- here and other places -- data deduplication will be a feature, not a product.
Before long, we'll see it just about everywhere. Source-side. Target-side. In local and remote replication scenarios (e.g. EMC's RecoverPoint). Even certain use cases for primary storage, with a few caveats.
Vendors who have broad capabilities and know when to apply which form of dedupe for specific use cases will probably end up doing a better job for customers than vendors who claim that their way is the "only" and "best" way.
I think we've all seen this movie before ...
And, no surprise, it'll continue to be a useful tool for the storage administrator in keeping a lid on rising storage costs.

Let me add two more reasons:
1) Inertia. Never underestimate the unwillingness of backup admins to change. Backup applications are the stickiest application I have ever seen. People change mail systems more often. So it takes a disruptive event (like VMware) for it to be a real consideration.
2) We are the only ones really championing source side dedup. Symantec more or less gave up the ghost and went target side with PureDisk, and most other vendors just don't "get it"--they think source is the root of all evil (like DD) or worse, just generally pointless. So we need to do a lot of educating.
Posted by: Scott Waterhouse | October 31, 2008 at 06:40 PM
I think it is already there, we've just forgotten about it (source based deduplication that is). It is already happening at the application layer in many instances. It occurs with database based on linked records, Window's has it built-in on certain server platforms for the file & print services. Exchange has been using it for a few years.
However, Chuck, you hit it spot on. Source based deduplication typically limits the data set. There is serious talk about the possibilities of having global deduplication, however, I still feel that certain data sets should NOT be deduplication, and certain applications should start having this functionality built in, in a more intelligent manner then just realizing that there are duplicate block (block sets).
I still consider deduplication as a feature in it's infancy. I also have the best deduplication theory in development...break every data set into binary, then I can store everything as just a single 0 and single 1. (reading the data I'm still working on)
Posted by: Steven Schwartz - The SAN Technologist | October 31, 2008 at 08:04 PM
Is this a subject more in the realm of Applications and the way they choose to store data they generate, rather than being something to do with storage directly? There could be such utilities that implement source side dedupe and applications can subscribe to their APIs and use them when required, in a collaborative way. Doing source side dedupe across LAN and across storage boxes may be ok for some dedicated applications where the location of the base data is always known and well controlled and the application is aware of such dedupe implementations.
Dedupe across LAN/WAN inherently could add some degree of uncertainty: Data now is no longer autonomous and stored in one reliable location. Instead, a series of links to various such data blobs (stored across storage boxes across LANs) are also required to qualify the storage picture. These links are vital and also need to be stored just as reliably as the base data itself. This loss of autonomy and consequent increase in dependence on various other storage entities to construct the storage picture, could be a cause of concern for some applications.
Target side dedupe could be more assuring, since the storage box is “aware” of and actively involved in providing a consistent storage picture to a consumer/application. Also, for now, target side dedupe is restricted to a storage device and the scope is probably much less expansive than it is in a LAN.
regards
sudhir.brahma@gmail.com
Posted by: Sudhir Brahma | November 02, 2008 at 10:03 AM
As the saying goes, "Backup is one thing, recovery is everything." Users should make sure they understand their RPO and RTO requirements and ensure the dedupe solutions they choose are aligned with those critical factors. Thanks. - Dave from wikibon.org
Posted by: David Vellante | November 03, 2008 at 04:38 PM
I'm sure it's no surprise I've got something to say about this. First, I'll give you MY answer your and George's question.
I agree with Scott that the primary reason is inertia. It just takes a long time to turn the direction of the backup ship. (The aircraft carrier I served on had a turning radius of over a mile.)
My other reason is that source dedupe is primarily for remote sites, and most people are focused more on the central sites. They tend to ignore the remote sites, so it causes a greater level of inertia.
My third reason is the flip-side of the second reason. It's not JUST that source dedupe really helps back up remote sites. Source dedupe also doesn't play well in large data centers. The backup speeds (and more importantly) the restore speeds are simply not YET up to the speeds that today's large data centers need, leading them to target dedupe for the data center (where the bandwidth also happens to be less of an issue).
Finally, I do feel the need to correct some Scott said in his comment: "We are the only ones really championing source side dedup. Symantec more or less gave up the ghost and went target side with PureDisk."
That simply isn't true. First, there are several smaller vendors that are championing it. You are the only major disk vendor to do so, but that's no surprise since you own Avamar. As to Symantec giving up the ghost, that's complete nonsense. Their technology is equivalent in many respects to Avamar. They are a source dedupe product through and through with global dedupe, replication, etc. The "target side" bit is that they chose to prioritize expanding it's use as a target side dedupe for non-Puredisk backups over making it more seamlessly integrated with NetBackup. EMC, OTOH, chose to prioritize making the NetWorker/Avamar relationship more seamless before doing other things. They have not given up the ghost and I've actually personally witnessed some very large Puredisk deployments.
Posted by: W. Curtis Preston | November 10, 2008 at 04:42 PM
Thanks, Curtis!
Posted by: Chuck Hollis | November 10, 2008 at 05:39 PM