Memes come and go in this industry.
One popular one that's making the rounds is the Doeswijk Data Model, as described by my esteemed colleague and competitor Hu Yoshida over at HDS.
Now, I don't have problems with simple models to illustrate a concept or two, but every model (and analogy!) has its limitations.
In this case, I've now started to see a few people use this model as a basis for creating their overall storage strategy, and the results -- well -- haven't been pretty.
I do want to thank Hu for sharing the model. And while I want to praise its strengths, I also want to highlight a few key intellectual flaws that can lead you astray if you're not careful.
Rather than reprint graphics here, it's easy enough to describe the mental picture.
Imagine an x,y,z coordinate system. One axis is "production data". The second axis is "replica data". And the third axis is "archive data".
The idea is that -- as production data grows linearly, overall storage requirements grow as a cubic function.
That means "really, really fast".
And, according to this model, you'll need storage capacity that's greater than all of it.
So far, no real harm -- until you take the next step ...
From Model To Strategy?
Nothing wrong with pretty pictures -- most of the time.
Unfortunately, one customer I met had taken this generic cube, and attempted to partition it into different storage domains, e.g. "here's the platform we'll use for production" and "here's the platform we'll use for replicas" and so on.
I don't think that line of thinking will get you very far.
Depending on how you define these three terms, you may be better off with a single storage platform that does all three, specialized versions for all three, or any other variation.
As an example, mission-critical OLTP and big video files that people occasionally access can both be described as "production data". Same for "replication" and "archiving".
It's not that it's a wrong distinction, it's just not a very productive way of describing things to my way of thinking.
Another customer jumped to the key insight that -- by reducing production data -- you automatically reduce replicas and archive copies as well.
Yeah -- sort of.
But if the goal is overall storage capacity reduction, there's so much more to talk about.
For example, not all production data requires replicas and/or archives, does it? One could imagine going through what is being replicated and archived (e.g. another form of the tiering and service catalog discussion) as an equally valid means to reduce overall storage volume.
And -- hey -- while we're at it, there's a rich "enabling technology" discussion that doesn't result from this model. For example, data deduplication applied to backups and archives can yield stunning results in capacity reduction for very little effort. Ditto for the use of enterprise flash drives instead of short-stroking multiple fibre channel drives. Or using virtual provisioning to cut down on unused capacity.
There's more, but I think you get my point. It's not that the model is wrong, or bad -- it just doesn't seem to lead you anywhere useful.
Is There A Better Model?
Yes -- there are two actually. One is top-down, the other is bottom-up. Let's start from the infrastructure and work upwards.
For the last several years, EMC -- as well as a few other vendors -- have been promoting this idea of a service catalog for storage.
They're relatively simple charts -- although not as simple as the model discussed here (sorry).
The key concept is to create multiple service classes for storage (also referred to as tiers and usually named after metals: gold, silver, bronze, tin, copper, etc.).
Specify which attributes matter most in each category: performance, availability, recoverability, retention, etc. Don't make the fatal mistake of naming specific products or technologies, just external attributes.
Now map all your different storage requirements to each bucket. One or two buckets is too few, ten is probably too many.
Calculate monthly costs for each service class using reasonable industry benchmarks -- we've got those, if you don't. That's now your "storage price list" for your internal customers. Even if you don't explicitly charge back, you've given your users the important information they need to make an informed choice as to balancing needs vs. costs.
Now building the storage environment gets much easier: your job is now to build a storage infrastructure that delivers the required service levels at the best overall cost: capex and opex.
When new whiz-bang technology comes around, it's a lot easier to evaluate it in the context of your service catalog. A potentially complicated discussion becomes wonderfully straightforward, if you think about ut.
The same service catalog approach can be applied to business continuity, backup -- all sorts of storage-related disciplines.
And -- trust me -- once business users are made well aware of the true costs of their choices, it'll drive some interesting and productive discussions regarding how they use storage.
Many customers that I work with have now taken this foundation, and started to work the problem from the other angle: top-down.
This usually takes the form of IT sponsoring an "information governance board" to help the business hammer out the policy tradeoffs between costs, risks and value generation.
Armed with the true costs of various de-facto information management policies (e.g. email deleted after 30 days, files kept forever, useless backups kept for 7 years, etc.), IT can join forces with finance, legal and business stakeholders to come up with more workable policies (and associated investments) that fundamentally address the root cause of the information storage problem -- poor information management policies.
And that's not IT's fault.
I've been collecting some good examples of how these information governance boards function, and some of the newer policies they come up with. As an example, you would not be surprised to learn that a draconian email policy (e.g. small mailboxes) result in very large personal email archive files everywhere. Or that pruning older data out of databases usually results in analysts keeping a copy where IT can't find it.
There's Nothing Wrong With An Illustrative Model
Please, I'm not criticizing anyone here. My only point is that models -- and analogies -- only go so far.
And, when it comes to managing information growth, and building supporting storage infrastructure, there's a certain risk to oversimplifying things.
What do you think?