In our little storage blogging world, a spirited (yet ultimately friendly) spat has broken out between Storagezilla and Ruptured Monkey over the issue of RAID 6.
[note: where do they come up with these blog names? I am most desperately uncool by comparison …]
I’m not going to wade in here on the pros and cons, but what I am going to do is step back a bit and talk about how EMC viewed the situation, and what we did about it.
It’s an interesting behind-the-scenes story in the contrast between IT vendors providing features, and IT vendors solving problems.
Context
RAID 6 is another form of parity protection for disks. Sometimes called dual parity, or double parity, it ensures that if two disk drives fail within a single RAID group in a short interval, data can be rebuilt and no data will be lost. By comparison, RAID 5 will help protect you against a single disk failure, but not two.
Nice trick, but there are tradeoffs, of course. You’ll probably need more capacity for the extra parity information than you'd otherwise need. Maybe a potential performance impact on writes. There’s more complexity to manage, if you’re not careful.
And, of course, it’s just one aspect of a much broader discussion of making sure customers always have access to their data.
Focusing on a single feature and ignoring the broader discussion raises the potential of setting incorrect expectations with your customers, which is not good.
No free lunch here … but a potentially useful feature.
The early discussion
Dual parity schemes are nothing new; they’ve been a topic of discussion in the storage business for at least five years, maybe more.
Single disk failure could be covered by traditional RAID approaches, but we knew that disk drives were going to get bigger, and – as a result – RAID rebuild times would elongate, creating a larger statistical window where a second drive could potentially fail.
Early on, we recognized that dual parity schemes might be attractive, but it wasn’t entirely obvious to us whether it solved a real problem, or it was just another marketing feature in search of a problem. And we knew there were other ways to solve the dual disk problem than simply doing RAID 6.
The engineering discussion raged back and forth on the alternatives, the pros and cons, the tradeoffs, and so forth. Everyone had an opinion, naturally.
And, keep in mind that if EMC invests in RAID 6, there’s something else we don’t get to do. It’s not whether or not RAID 6 is important; it’s whether or not it’s more important than something else that’s on the list.
Life is all about tradeoffs, isn’t it?
EMC tracks customer outages pretty obsessively – for any reason whatsoever – with statistical precision. We had the data that could tell us exactly what we wanted to know.
But we didn't ask the question you'd expect.
It wasn’t “how many times do we see dual disk failures?”.
It was “what can cause a customer to lose data, and where should we be spending our time?”. There’s a significant difference.
Ask the right question, you’ll have a better chance of getting the right answer.
The results were kind of surprising
Now, the first time you look at overall availability for EMC products, the vast majority of arrays never experience a serious problem, like data loss. Here’s this broad panorama of good stuff, and here’s this itty-bitty, teeny-weeny part of bad stuff.
It’s easy to say “hey, everything’s good – look at all that positive data”.
But you don’t get better that way, do you?
The trick is to ignore the successes, and zoom in on the failures – magnify them so that’s the only thing you’re looking at.
Make the bad stuff fill the powerpoint slide. Don’t show any of the good stuff.
And then methodically work your way down the list to root-cause the Pareto of problem areas.
By the way, a Pareto list is jargon for a bucketing of symptoms with the frequency of observation. It’s a simple concept, but a powerful tool to help you focus on what’s important and what’s not.
On this particular Pareto list of “what can cause a customer to have a problem?”, many of the problem areas didn’t have anything specifically to do with the array; it was an external issue.
High on the Pareto list of potential problem areas were things like “customer tried to change something and made a mistake” or “dual pathing wasn’t configured properly for an important server”.
Somewhere in the middle of the list were things like “someone did a code upgrade, and something went wrong”, and “fibre loop had a problem, and didn’t recover gracefully”.
And way, way, way down the list – almost statistically insignificant – was dual disk failure in a single LUN group. Even there, there were choices about what to do about the problem, one of which was RAID 6.
I’m not saying it never ever happened, but the things at the top of the list were maybe 100x times likely to happen, things in the middle of the list were 20-50x times likely to happen, and so on.
What we decided to do
So the EMC team made a rational decision – why don’t we focus our efforts on the things that actually cause customer problems, rather than marketing features that may or may not make an impact?
The counterargument was “hey, everyone else will have this feature, so we’ll be at a competitive disadvantage”.
But cooler heads prevailed, and the investments were made appropriately. We’d focus on things that had hard data behind them. And if that eventually led to us doing RAID 6, so be it.
Make the investment that delivered the best result, right?
What happened
The first cluster of problems centered around customer reconfigurations. It turned out we were letting people hurt themselves in very unusual circumstances, and that wasn’t good.
Everything looked legit when they set it up, but later on, something would happen that in retrospect made the original change look ill-advised.
So we upgraded our management interfaces to have extensive (and updatable) rules checking to make sure that our customers couldn’t inadvertently set up a configuration that could lead to problems down the road.
And the data got better.
One cluster of problems centered around upgrading code to arrays. We found that problems (other than servers being shut down) could be caused during the upgrade window when the array was shut down and brought back.
Our first step was to make code upgrades (as well as hardware upgrades) non-disruptive to running applications. If the customer didn't have to shut down the application, another source of problems would be eliminated.
Today, all EMC storage platforms are NDU (non-disruptively upgradable) for hardware and software. Not only did we make life easier for IT administrators, we eliminated a significant source of data unavailability and potential data loss.
And the data got even better.
Shortly thereafter, we found that, during a code upgrade, we couldn’t always count that we had a pristine environment, or that process was being followed correctly.
So lots of investment went into check-ahead, roll-back and bulletproofing the process so that it just couldn’t go wrong, but if it did, it would roll back to a safe state.
As an example, on the CLARiiON, we made it so customers could do the code upgrades themselves(safely, predictably, non-disruptively) if they wanted to. Now that’s a bulletproof process.
And the data got even better.
Another cluster of problems arose around servers that should have been dual pathed to the array, but weren’t for whatever reason.
Now, some storage vendors might say “well, that’s not our problem”. But we felt we could help here, so we got busy.
We made a special version of PowerPath available (at no charge) to do simple failover.
We instituted a rigorous health check of our customer environments to identify any single-pathed servers and worked with the IT staff to remedy the problem.
We beefed up our configurators and best practices guides to make it very difficult to single path a server.
And the data got even better.
We started to work our way down the list even more.
We noticed that drive reliability could be an issue. Again, the vast majority of drives are very reliable, but when you threw away the good data and focused on the itty-bitty, teeny-weeny bad stuff, a different picture emerged where we could dig in, and work with the drive vendors to improve the situation.
We also found that better proactive drive diagnostics in microcode could do an excellent job of detecting impending failure (e.g. retries, recals, minor errors, etc.) for some of these more unusual drive failure modes. We found that using intelligent diagnostics combined with a dynamic hot spare proactively shut an import window more cost-effectively than other approaches.
On the CLARiiON the dual fibre loops on the back end would occasionally (and we’re talking very occasionally) have a problem, and the failover protocols wouldn’t recover correctly.
Improved components and software enhancements made the problem better, but didn’t nail it. The answer turned out to be a complete redesign of the back end using a DAE-level point-to-point switch (first announced as UltraPoint in the CLARiiON) that not only absolutely nailed the problem, but gave us a performance boost as well.
And the data started to get really, really, really good.
There’s far more to the story, of course.
I just wanted to illustrate the data-driven approach to solving the ultimate problem, just how much effort it takes, and how it takes you down roads you may have never thought of going down.
How it all ended up
I think we did the right thing.
As an example, right now, we’re starting to promote the fact that current CLARiiONs run at a minimum of 99.999% (five nines) availability.
It’s not the result of any specific whiz-bang feature; it’s the result of a comprehensive approach to nailing the dozens or hundreds of issues that can impact availability. Some live within the array; some don’t.
Based on our data, that would make it the most reliable mid-tier array in the industry by a wide margin, and more reliable than certain “high end” enterprise arrays that are being marketed (sometimes even “guaranteed”). EMC’s high-end Symmetrix DMX is even better than that.
But we paid a price. There’s a certain part of the storage market that is obsessed with specific marketing features, rather than results claimed.
As of this writing, we haven’t offered support for RAID 6. I think you may see it before too long.
But, as a result of our decision, I’m sure that every day someone somewhere is being pounded for the fact that EMC doesn’t offer RAID 6 like some of the other guys. And, if and when we do offer it, I'm sure we'll get pounded by some for being late and so on.
Simply put, we made a decision to go after other problem areas first.
As a result, I think we have something more valuable – a storage environment that can be statistically shown to be better at not losing data than other alternatives.
Is there more to do? You bet.
There’s always six nines ….
[update on 9/20/2007 -- for some reason, this article continues to get visited, so I thought I'd update it with what happened since it was written. First the Symmetrix team announced and shipped RAID 6, and recently the CLARiiON team did the same. So I guess it's a moot point -- but the thinking about how we approach these problems hasn't changed -- cph]

Hi Chuck,
Thats a really insightful and interesting look into how EMC have approached this. Thanks for taking the time to share it.
Ive mentioned it over at rupturedmonkey on my most recent post.
Nigel
PS. I'm a little unclear where the name rupturedmonkey came from but it was previously a security related website before Snig, the current owner, took it over and changed it to a storage website. Its not actually my site, Im just one of 3 guys who current blog there.
Posted by: Nigel Poulton | January 10, 2007 at 04:51 PM
Thanks for the overview on how EMC² approaches solving problems. The term "marketing feature" has me puzzled though. Was RAID-S a marketing feature? Or was RAID-5 a marketing feature until you guys adopted it on the DMX?
The name Ruptured Monkey comes from the old saying "getting the monkey off my back". We just decided to rupture the monkey instead of being nice to the monkey. ;) It's all about helping others solve problems.
Posted by: Snig | January 11, 2007 at 07:53 AM
Hi Snig
I'm going way back in history here, but I think the reason they called it "RAID S" rather than "RAID 5" is that the early Symmetrix implementation didn't have distributed parity between the participating logical volumes.
Unlike RAID 5, RAID S had a dedicated parity volume. Since it behaved differently than RAID 5, I think the logic was to call it something slightly different.
If you are trying to extend the statement I made about RAID 6 to RAID 5, I don't think it applies.
Let me take it in the other direction: if protecting against one disk failure (RAID 5) is good, and protecting against two simultaneous disk failures (RAID 6) is good, then -- by deduction -- protecting against three simultaneous disk failures (RAID X?) is good. And four disk failures, and so on.
Now, most people would reject the statement above, right?
Clearly, there are decreasing returns for the costs associated with additional protection. And the hard data we saw between RAID 5 and RAID 6 made us think about the problem differently.
Whether you agree with the thinking or not, I always think it's useful to share the logic that went into the decision.
Thanks for the comment!
Posted by: Chuck Hollis | January 11, 2007 at 08:25 AM
Quick update --
Just recently, EMC announced RAID 6 for the DMX platform. I'm sure at some point we'll see it for CLARiiON as well.
Thanks, all.
Posted by: Chuck Hollis | February 21, 2007 at 07:32 PM
Chuck, I owe you a big vote of thanks. Here's the story:
A week ago, one of my raid1 arrays went south and upon reconnection, updated from the offline drive, wiping a weeks worth of data off the good drive. :-{{
So I was busy investigating Raid5 or 6 for myself when I ran across your blog article. Hit the nail on the head, it did! I sat down and thought thru the whole process and here's what I found.
Primary cause: While I had the system open for swapping out a dead DVD, I brushed against one of the signal cables and dislodged it. The SATA connector had worn loose in just a few plugs.
I missed the 'array failure' warning until I rebooted, then went in and replugged the connector, not noticing how loose it had been. During the reboot I picked the default drive to rebuild from, which was wrong.
Net - one hardware failure, one cockpit error yields lost files. The disks were fine. Raid5 or 6 would have made no difference.
So, I am replacing the cheap SATA cables with ones that have snap clips. At $2 each, cheap insurance. Added a note to self - Pay Attention on Rebuilds!
Thanks a bunch,
BillN
Posted by: Bill Nicholls | February 22, 2008 at 12:26 PM
> First the Symmetrix team announced shipped RAID 6,
> and recently the CLARiiON team did the same.
So what's changed? 18 months ago, RAID-6 was irrelevant and now it's shipping in virtually all EMC products.
Seems like there are three possibilities:
1. The data suddenly changed in favor of RAID-6 (doubtful), or
2. You suddenly realized your analysis, though genuine, was faulty, or
3. Customer and competitive pressures trumped your "data", and you realized your "data-based" approach really wasn't the best way to make product feature decisions.
...or possibly some combination of the above. What's ironic is that behind closed doors, I'll bet you were reading the riot act to EMC engineers for not investing in this technology sooner ;-).
Any of the above true?
Posted by: Josh Savage | August 22, 2008 at 10:58 PM
None of the above.
For both platforms, there was a pareto regarding the most important challenges to go address first. Once the more important ones were solved, the team kept working down the list. And eventually got to RAID 6.
Just like the post says ...
At the time this post was written, certain vendors were making rather reckless statements regarding the superiority of RAID 6, and the inadequacy of RAID 5.
We believed at the time -- and still do -- that there's far more to end-to-end availability than just the choice of RAID.
Except that now that particular RAID 6 promoting vendor is now shipping RAID 5 VTLs .... !!
Posted by: Chuck Hollis | August 24, 2008 at 08:52 PM
"So the EMC team made a rational decision – why don’t we focus our efforts on the things that actually cause customer problems, rather than marketing features that may or may not make an impact?
The counterargument was “hey, everyone else will have this feature, so we’ll be at a competitive disadvantage”. "
That is such a lame excuse for a supposed storage leader to leave out a feature that gets used quite a bit.
Posted by: Mark | October 02, 2008 at 08:28 PM
Hi Mark -- the post you're commenting on is quite old, did you realize this?
You're entitled to your opinions regarding "lame excuses" from storage leaders.
Frankly, I think yours is a pretty weak comment, compared to some of the other more insightful ones we've seen here.
Thanks anyway ...
Posted by: Chuck Hollis | October 03, 2008 at 12:09 AM