And then the "blame" commentary started -- whose fault was it, the vendor or the service provider?
Disclaimer: I know absolutely nothing about the details of the situation, other than what Beth shared. And, for the purposes of this post, I'm speaking entirely as myself, and not an Official EMC Representative.
My initial thought: playing the blame game is counterproductive, we're all at fault to some degree.Which brings up the discussion -- what do we collectively need to do to avoid this sort of thing happening in the future?
What Happened -- I Think
If you read Beth's account, the gist of the story is pretty clear.
Customer was delivering e-mail service using a CLARiiON. The CX uses a dual-controller design, as do most mid-range storage products in the market from just about everyone.
One of the controllers failed, and shifted the workload to the survivor, as it was designed to do. But there was too much workload for the surviving controller to keep up: email queues backed up, customer email wasn't getting through in a timely manner, and apologies had to be made all around.
Point 1 -- Products Shouldn't Fail, Right?
Yep, I can't argue with that one. We as vendors should strive to make hardware and software that is reliable as possible, while being cost-effective, performant, standardized, etc. etc.
But, as anyone in this business knows, things can go wrong, and often do. So you build redundancy and failover into your products and your architectures, in case the inevitable happens. Dual fans, power supplies, RAID, controllers, paths, etc. -- because things do fail.
And for all of you competitive types who enjoyed a spirited round of adolescent poo-flinging, I think you're missing a key point: technology products occasionally do fail.
All of them.
Even yours.
Point 2 -- The Customer Should Have Designed For This, Right?Easy to say, harder to do. Workloads can grow and grow, such that what might have been OK on Day 1 might have been in the failover danger zone on Day 180.
I mean, let's face it. You can connect a whole bunch of servers and a whole bunch of capacity to a CX array, and drive an enormous amount of I/O between the two. Failover sizing might have been adequate going in, but the application grew, and more servers and more I/O was added perhaps without revisiting the basic design assumptions.
This sort of thing happens all the time. How about your environment?
And don't get me started about all the environments we've seen that haven't even done simple things like configure redundant pathing, or -- ahem -- naively think that RAID and sync replication eliminates the need for a backup.
By comparison, their error was relatively minor -- although the impact was unfortunately large.
Point 3 -- Should The Product Have Alerted The Administrator?
OK, safe failover headroom was exceeded, and that's not good. Perhaps the array should have spotted the problem, and made a fuss.
Well, it does. Performance monitoring and associated alerting is pretty easy to set up on all EMC storage products, but there's an implied process that's needed of someone actually watching out for this.
I have no way of knowing if this was configured, or -- if it was -- was there really anyone watching?
Point 4 -- Should EMC Have Designed The Environment To Avoid This?
OK, I'll take the blame for this -- sort of. We pride ourselves in putting a lot of thought into our customer designs. I'd argue that we're really, really good at it as well.
But not everyone is 100% sure of how their application will grow over time -- unfortunately, we're not psychics. And, let's be honest, not everyone necessarily wants to pay for redundancy we like to put into our designs.
We don't always get to directly engage all the time, either -- with products such as the CLARiiON, most of this stuff moves through the channel. Somebody calls up one of our partners, says that they want to buy one of our products, and one gets sold -- and a lot of product gets sold that way.
How Can We Avoid These Situations Going Forward?Let's face it -- this backlog was a major deal for the service provider, not to mention EMC and perhaps our partner if one was involved. And trying to nail the guilty party misses the point entirely.
We -- as vendors -- must continue on our journey to make our products more reliable, do a better job of automatically alerting various parties when key thresholds are exceeded, etc. At EMC, we have a pretty clear agenda -- backed up by considerable investment -- in this regard. I believe we do it better than anyone, which is why so many critical environments use EMC products. That being said, there's always room to do better.
But I think that the people who deploy and run these infrastructures have a certain responsibility as well: practice, practice, practice.
Practice?
Yep, practice.
I don't know about you, but when I was a young kid in school, periodically the fire alarm would go off, and we'd practice getting out of the building and standing in the rain -- even though there never was a fire. I think the idea was to avoid panic and chaos should there actually ever be a fire.
Same idea applies here.
I heard one very experienced operations manager put it this way -- "it ain't protected unless you can prove it, and the only way you can prove it is to do it".
He went on to say that, occasionally, he'd fire off an email to his team along the lines of "BANG! Production Oracle is down due to data corruption, you've got 30 minutes to get it back online". Or "POW! Go fail server #42 by pulling the power cord, and see what happens".
You get the idea -- he thought it important to continually exercise the organization's ability to recover from a production problem -- everything from small components failing (like network devices) to big stuff like disaster recovery practices.
You *do* practice disaster recovery, don't you? :-)
Now, you may be reading this, and thinking "what a nightmare", but there's a certain logic to this around routinely exercising the technology -- and the processes -- that are in place to protect the IT infrastructure.
And, as our environments get ever-larger, ever-more-complex, and ever-more-inherently-reliable, the incentive to practice, practice and practice the unexpected goes down proportionately.
It just gets deferred until, well, something bad happens.
And that's something we *all* want to avoid.

Chuck, when I saw this hit the fan I though "oh man, here we go" and in fact I sympathetic because... well, bad stuff happens to all of us from time to time. It's how you learn from it and react to it that defines your character.
I think your response to this is spot on and classy. Hopefully everyone in the industry will take this as a lesson learned and not an opportunity to trash a competitor.
Posted by: John Dias | April 29, 2010 at 10:53 PM
All valid points, Chuck. I'm one of your value added resellers. All I can say is that I'm not surprised by this. Whether intended or unintended, people take risks, sometimes too much risk.
It takes a village to get and keep storage right.
Service providers have a lot of pressure to drive costs (including storage costs) down but they need to know that they are taking risk and then manage it. It's not just EMC or the reseller that have responsibilities. Practice, indeed!
Posted by: Pat Adams | April 29, 2010 at 11:22 PM
John and Pat -- thanks for both of your comments -- they're appreciated!
-- chuck
Posted by: Chuck Hollis | April 30, 2010 at 12:04 AM
I was working as a contractor at a regional bank on a technology refresh. New production, dev&test and DR-site. When I suggested that we practice DR fail-over, etc for their ops staff, the manager was aghast, it would cost too much! Taking people away from their 'day jobs', the cost of the contractors' time, what if something breaks? Some time after cut-over, they did have 'an issue'. I had to fly in super urgently! Why? because the staff were 'scared' of activating the fail-over scripts. So very sad, and the vendor got blamed!!
Posted by: Joe Svankanski | April 30, 2010 at 10:31 PM
I agree with Chuck's point but even more than practice
I suggest designing for failover as a standard procedure.
Rather than a primary and a backup I suggest an "A" and "B" systems and switch between the 2 every 3 to 4
months. Just because you can failover for a few minutes
does not mean you can run for a day or more and if you
do have to failover I expect it will be for more than
a few minutes.
That being said all vendors need to improve their monitoring AND reporting tools.
Posted by: Rick Parker | May 01, 2010 at 12:08 PM
Good post Chuck. As a service provder this incident hit home, and you can bet I did some digging to get some more information on the backstory.
To your comment about "practice", one thing to remember is that (especially for something like SP load) keeping up to date with your FLARE code versions is a GREAT way to test that you have the capacity needed to be redundant. Since that SP failover is part of the the upgrade process, you kill two birds with one stone. It should be part of your normal management process on your array anyway, so it's a easy way to keep an indirect eye on your growth, even if you don't have any other monitoring set up, either through the array or through an additional monitoring tool.
Posted by: Jeramiah Dooley | May 02, 2010 at 08:33 PM
Jeremiah -- excellent advice to all -- thanks for sharing!
-- Chuck
Posted by: Chuck Hollis | May 03, 2010 at 09:36 AM
Great post.
Currently where I work as an Architect we design failover and redundancy into every solution be it using EMC or HP Storage. This can be great but is a mass of overkill for somethings but the policy stands as that's the way they have always done it here.
But you would think for something as important as email that someone would have done this, especially in this age where email is seen as one of the top business critical applications. How times have changed from email just being a form of communication to now being something people can't live without.
Posted by: Dominic Cody | May 04, 2010 at 07:54 AM
Hi Dominic
What struck me about this case was that redundancy and failover was ostensibly designed in -- but not continually tested. I suspect that creeping I/O growth moved their design from "green" to "yellow" to "red" over time -- and that's the lesson I took from this.
-- Chuck
Posted by: Chuck Hollis | May 04, 2010 at 08:58 AM
Hi Chuck,
Yes true a very good lesson.
A lot of organisations (mine included) could take a lot from this in that just becasue something has redundancy and failover built in doesn't mean you can ignore it and assume all is ok with the world.
Dominic
Posted by: Dominic Cody | May 05, 2010 at 07:15 AM
Chuck, I understand your point and you are absolutely correct - all technologies can fail and will. However the issue I take here is with EMC's practice of throwing mud on other vendors' technologies' behavior in failure situations. Hopefully that practice will stop.
Posted by: Frank Finley | June 03, 2010 at 10:56 AM
Frank
I agree with you, any sort of mud-throwing is frowned upon. But, generally speaking, that's not our practice here.
Can you offer up any examples?
Thanks!
-- Chuck
Posted by: Chuck Hollis | June 03, 2010 at 11:02 AM
Chuck sorry for my slow response - my work takes me on the road a lot and I can't keep up like I used to.
In your request for examples, I have to say I've seen the same from other sources as well. I'm glad to hear that it's not EMC policy to try to make the "other guy" look bad. Specifically I'm talking about the "Storage Anarchist" blog about IBM's XIV product. It was pretty nasty and my IBM rep tells me that it's not accurate. He does say that like every other technology out there there are instances where problems can occur but I've never heard him talk about EMC failures (though they surely exist) and instead brings the discussion to customer experiences of his products versus technical arguments about the other guy.
I have IBM and EMC in my shop and both are great products. I enjoy your blog when I get the time to follow it!
Posted by: Frank Finley | June 09, 2010 at 10:09 AM
Hi Frank -- thanks for making the time for a response.
Barry can be rather negative towards technology he feels is substandard for purpose. Although his accuracy level is extremely high, what is more subjective is whether or not his observations are relevant for your particular use case.
We both have issues where a product -- any product -- is oversold for a particular use case. Regarding XIV, I'm sure there are many situations where it will do fine, and many situations where it will not.
It's not a "bad" or "good" discussion, it's more of a "what is important to you discussion?"
But your comments are spot on, so thanks for sharing.
-- Chuck
Posted by: Chuck Hollis | June 09, 2010 at 12:43 PM