EMC today announced a whole slew of updates to our backup, recovery and archiving (BuRA) portfolio. Thanks to 'Zilla and Scott for covering different aspects -- since there's a LOT to cover here.
Among all the goodies, I noticed that RecoverPoint had a slew of interesting new features. And, thinking about it a bit more, I guessed that maybe a few people weren't aware of just how cool this replication technology could be.
You can get all the product detail from the web site, if you're interested. Instead, I'm going to spend my time highlighting a handful of concepts that I find intriguing.
A Lot To Talk About
RecoverPoint is interesting in two ways: what it does, and how it does it. For me, anyway, it represents some of the most advanced thinking around local and remote replication.
RecoverPoint started as the acquisition of Kaysha in May of 2006. By September of that year, we had the first version of the EMC-ized version available for sale.
At the time of the acquisition, people tell me Kaysha has something like 50 or so production instances. If we stay on track, EMC will finish 2008 with something like 1,000 or more. As far as such things go, this represents unqualified market success, especially for a product with such a distinct approach to a very traditional topic.
It's also somewhat remarkable that we're doing this while offering RP alongside more traditional array-based replication capabilities, like SRDF and Celerra Replicator, which are also doing well.
The Magic Of Continuous Protection
Snaps, clones, BCVs etc. generally represent information state at a given point in time. The more snaps you take, the less exposure you have in losing data that's changed since the last time you made a copy.
Recoverpoint goes one better and logs every update to a block device. This log can be just about anywhere -- on the same array, on an array in the same data center, on an array in a nearby bunker, or at considerable distance.
Many customers seem to go with two simultaneous recovery logs -- one local for operational recovery, one remote for more serious protection.
Arbitrary point-in-time volume images can be reconstructed on demand. These can be presented and used without disturbing either production data, or the continuous logging process.
I've fallen into the habit of calling this "Tivo for your data center", since you can rewind arbitrarily complex applications to any given point in time to either recover, or perhaps do a bit of forensics.
Data Reduction Everywhere
When doing asynchronous log shipping over the WAN or LAN, RP looks for redundant data in the transmission package, (e.g a block that was written and rewritten), as well as apply more traditional data compression and dedupe. Although the effectiveness of this approach varies with the duration of async window (e.g. 30 seconds, 30 minutes, etc.) most customers experience a 5x-10x bandwidth reduction as compared to uncompressed replication approaches.
The log files are compressed and deduped as well (naturally), but -- with this new release -- they also can be consolidated automatically.
An example might be to run continuous logging for data that's 3 hours old, then consolidate into 30 minute snaps for anything that's less than a day old, then consolidate into daily snaps for anything that's less than a week old, weekly snaps for anything less than a month old, and so on.
And, of course, these snap images are both deduped and compressed as well, if you choose.
That sort of capability makes me think of snaps a bit differently. You get something that is ostensibly "better than snaps" at both ends of the spectrum: infinitely fine-grained recovery for very recent activity, with automatic consolidation into traditional snaps for older activity. And, of course, space reduction techniques throughout.
Tres cool.
Implementation Choices
A key component of RP is the "splitter", which makes a copy of written data for logging. This splitter can be located on the server itself, or using an intelligent switch, or (more recently) on the array itself, such as the CLARiiON. The CLARiiON splitter approach also turns out to be very convenient for iSCSI and mixed iSCSI/FC environments.
The metadata management and processing flow is accomplished by RP software running on an appliance which is not in the data path. A minimum of two at each side is usually recommended for redundancy purposes, although the architecture can scale up to a substantial cluster of 8. A convenient rule of thumb is that a single appliance can ingest and replicate approximately 50-60 MB/sec of written data.
Consistency groups are supported within a single appliance, not across multiple appliances.
The RP applicance has been qualified with EMC and non-EMC storage arrays, so customers have considerable freedom as to what's being protected, or what the target storage might be.
Management Control
The list of capabilities to define, monitor and respond to replication sessions is mind-boggling. Different RPOs and service levels can be arbitrarily defined for different applications, and there's a great degree of control in defining how available bandwidth will be used in a variety of real-world scenarios.
Available storage pools for logging, snaps, etc. can be easily defined, as well as desired behaviors when space inevitably runs out. As an added bonus, RP understands virtual (thin) provisioning.
Now, if we could just teach it to automatically use the spin-down feature in the CX for older snap sets :-)
New And Cool
One of the more interesting features of the 3.1 release is the first round of suport for "stretched clusters", the ability to locate the target at stretched FC distances (maybe a data center across town?), and have the server clustering software coordinate the failover of both processing and storage.
The stretched capability is also useful for "buddy" scenarios where you'd like two metro-distance data centers protecting each other.
Going a bit further, there's now official support for "cascading", which extends the scenario above to include a second longer-distance asynchronous hop to a more remote location.
Neither of these capabilities are particularly new, they've been available with products such as SRDF for a very long time -- but it's nice to seem them finding their way to RecoverPoint.
VMware SRM (Site Recovery Manager) support has been available for RP for a while -- very useful and very cool, but not exactly new in this release.
My Wish List?
Although RecoverPoint is a very advanced set of capabilities that are well-implemented and enjoying success in the field, I do have my personal wish list of what I'd like to see. There's always room to do more, right?
First, I'd really like to see the ability to run RP in a virtual machine itself. I think that would give implementors additional options for scenarios that don't necessarily require multiple dedicated appliances, in addition to being able to leverage VMware's HA and load-balancing features.
Second, I think it'd be great if RP was smart enough to exploit a bit more of the dynamic capabilities in EMC storage platforms: adjusting QoS of the array, moving logs and snaps to different tiers, or even automatically spinning down the older and less-interesting stuff.
Third, I'm curious as to whether the 50-60MB/sec ingestion rate is a processor limitation, or a storage limitation. If the latter is the case, I wonder what RP could do with a bit of flash?
The Real Question
We now are seeing signs of many customers preferring the more granular continuous recover paradigm as opposed to traditional point-in-time snaps. Will this trend continue? Will continuous logging of volume changes be the new standard in acceptable disk-based data protection?
Thanks to space efficiency techniques, one can't make the case that disk requirements are all that different than traditional approachs -- maybe even better, when one considers snap consolidation and dedupe/compression of the logs and snaps.
The feature isn't tied to any particular storage array -- customers are free to use it with whatever they want to, and the RP engine doesn't sit in the data path as is the case with a few of the alternative approaches.
I wonder how many RP instances we're going to have at the end of 2009?

G'day Chuck,
My customers love Recoverpoint - in fact I find it impossible to have a DR discussion without mentioning Recoverpoint and using the same "Tivo for your data center" analogy as yourself!
I think it's going to be interesting to see how the current economic climate affects our positioning of the product - hopefully Recoverpoint Virtual Edition just just aroun the corner :)
Cheers,
Posted by: Dudley Over | November 18, 2008 at 06:37 PM
Hi Chuck,
If I can add to the wish list.
Why can't we use RecoverPoint to replicate our NAS luns with the ability to recover to any point in time?
Wouldn't it be great to have the same tool replicating FC, NAS and iSCSI luns and maybe add consistency group on top of that?
Posted by: Royi Dankner | November 19, 2008 at 01:39 AM
Chuck, I agree with Royi, it would be great if Recoverpoint could integrate with Celerra but I have a feeling that will be tough since integrating it with Celerra disk pools will be fairly complex.
I also agree with Dudley, I love recoverpoint and I am starting to talk to customers about the new consolidation features in RP 3.1. This is really beginning to blur the lines between replication and backup. Now that we can integrate application consistent snaps using Replication Manager and Recoverpoint and the fact that we can keep meet longer retention requirements, do we really need backup anymore? At some point, I guess there is a limit to the ingestion speed like you mention above, but for many, this could be a total backup replacement. Especially if you are running CLR (CDP and CRR simultaneously). The blog post title is End of Snaps. I like 'End of Backups' :) Yeah, I know this isn't perfect but for some it will work quite nicely.
By the way, we recently used Recoverpoint and SRM to do a datacenter move for a customer. We failed over 40 virtual machines in exactly 40 minutes to a new datacenter with a new san and new esx servers. We only had a 10Mb link but we had 6TB of data. We did the initial sync with both SANs in the same building, then we stretched it and let it catch up. Then one night at about 1am, we pressed 1 button to start the SRM failover, and then we sat back and let everything run. It was incredible.
Jeremiah
Posted by: Jeremiah Cook | January 18, 2009 at 02:22 PM