Good economy or bad, information continues to grow at about 60% every year. Storage media costs are dropping at approximately 30% per year. If you stayed awake in math class, you'll realize that -- unless you take some serious steps -- you'll end up spending more on storage each and every year.
A lot of people naively assume that vendors such as EMC want customers to buy more storage. Actually, the opposite is true -- for the last 5-7 years, we've been working on technologies and strategies to help customers use *less* storage.
Individually, they're compelling. Collectively, they show the promise of being able to largely flatten the storage expense curve.
And that's a good thing.
Life Would Be Good ...
... if we could teach users to manage their own information -- tell us how important it is, how long it needs to be kept around, when it can be deleted, and so on.
Well, that's not going to happen anytime soon, is it? Sure, as more information ends up in repositories and object stores with richer metadata, we can begin to start tackling this challenge at another level, but for the immediate future, we're just going to be handed big piles of bits, and told to store them as efficiently as possible.
The big idea -- for the immediate horizon -- is to focus on technologies that don't need to be told anything explicitly about the nature of the information being stored. Anything we can do on top of that with metadata and policy is like icing on the cake -- so much the better.
So, let's take our tour, shall we?
Virtual Provisioning
Also known as thin provisioning, the idea is deceptively simple: allocate big virtual chunks of storage to applications and users initially, but only allocate physical storage when it's actually needed. Works really well in a large majority of use cases.
Not only that, most virtual/thin provisioning schemes do wide striping as well, which makes for a decent performance boost in many (but not all!) scenarios.
My take is that virtual (or thin) provisioning is table stakes now in the storage business. Every vendor either has it, or is going to need it really soon.
Not only does it cut way down on allocated-but-not-used storage, it simplifies life for the storage administrator, as well as the people who depend on this person getting the job done.
Some vendors point to the fact that they can reclaim storage if the application shrinks. Nice, but of questionable use, since it's the rare application indeed that gets significantly smaller, rather than significantly larger.
Active Archiving
The idea here is simple: once information has stopped changing, get it out of the production stream just as quickly as possible. Whether it's files, emails, etc. -- the larger the information repository, the larger the proportion is "cold" information.
Whether that information goes to a special purpose archive platform (such as Centera), or uber-low-cost storage, or even off to the cloud (think Atmos and other cloud storage services), the idea is to aggressively find and move information that's stopped changing.
At the very least, storage vendors should offer at least file-level archiving as part of their storage stack. Bonus points awarded for having answers for email, SAP, SharePoint et. al.
Fully Automatic Storage Tiering -- Flash and SATA
You've probably been hearing about FAST from EMC: fully automated storage tiering. The idea is simple -- just about every production use of information has a small hot spot, and a long, cold tail of information that's infrequently used. The challenge is that the hot spot can move around.
The idea behind FAST is to break an application's storage into chunks, move the popular bits into a small amount of flash, and keep the remainder on ginormous, low-cost SATA drives.
Chad Sakac gave a nice preview of what the technology can do at decent scale at VMworld. If you're a storage geek like me, it's pretty compelling stuff indeed.
EMC will be providing FAST on all of our storage platforms through 2010 via a series of different releases. The bottom line is this: by the end of 2010, people who've deployed FAST will enjoy not only superior performance, but dramatically lower storage costs as well -- the vast majority of their information will be on low-cost, multi-terabyte SATA drives.
Not to mention a dramtically simplified provisioning model.
My prediction is that, within just a few years, the comparative savings using FAST-like technologies will be so dramatic and obvious that it'll be table stakes in the storage business.
Dedupe and Compress
Another big win. By far, the most immediate payback for this technology involves backup, since there's an inordinate amount of replicated data in most backup pools. DataDomain's approach (target dedupe) means that you don't have to change anything in your backup process to gain a big heap of savings. Avamar's source dedupe approach means that you'll never back up the same chunk of information twice. Target or source (or both!) the choice is yours.
Production data can be amenable to dedupe and compression savings as well. Note, we use both terms, since data deduplication presumes that there's redundant data between objects, compression looks for savings within an object.
Celerra's dedupe has been in the market for a while, and we're finding that the combined dedupe/compress approach can deliver even more capacity savings than dedupe alone. Again, everything depends on your use case.
A lot has been made recently around capacity savings for the binaries associated with virtual machines (not user data!), especially in VDI environments. VMware's new Linked Clones capability is an excellent way to move the storage problem "up the stack" so to speak in a manner that's simpler and more efficient.
Data warehouse and BI vendors have known for a while that CPU cycles are plentiful and disk bandwidth is scarce. Many of the newer vendors have implemented dedupe/compression schemes to not only save on storage, but dramatically improve performance as well.
And I'm sure we'll see more variations on this theme in the future :-)
Spin-down
For some reason, customers don't focus on this as much as one would think. Maybe it's the natural focus on capex vs. opex, but a disk drive that's completely powered down consumes about as much power as a tape cartridge sitting on a shelf.
If we bump up a level, and assume that the storage array is able to detect the outer reaches of the long tail of cold information (needed to tier it appropriately, dedupe/compress it, etc.), it makes sense to extend this all the way out to disk devices that are spun down and consume almost no power.
I think part of the problem now is that spin-down is relatively hard to use: you have to explicitly assign information to spin-down drives, and manage the process largely manually.
But that won't last for very long. Give it a few quarters, and you'll start to see a greater degree of automation and transparency. Once that happens, I believe that -- yes -- spin-down will become another "table stakes" discussion in the storage business.
Where Does That Leave Us?
A nice set of tools to tame the storage beast:
Virtual provisioning
Active archiving
Fully automated storage tiering
Dedupe and compression
Spin-down
It's hard to argue which one is "best". I see various vendors emphasizing one approach over another.
That's very understandable -- if all you've got is a hammer, go looking for nails.
My view is that we're going to need all of them :-)

"Some vendors point to the fact that they can reclaim storage if the application shrinks. Nice, but of questionable use, since it's the rare application indeed that gets significantly smaller, rather than significantly larger."
Chuck, this is a very, very nice feature indeed. You can really reclaim space with this feature:
-If lun's have been ie. accidentally full formatted by admin (saved 15 TB's one time)
-When hosts are migrated to the storage array, free space can be reclaimed (ie. host is using 300 GB / 1000 GB, you'll get back almost 700 GB)
-You can reclaim capacity even on hosts that are moved to virtualized VMWare platform
In real life, it is very usual that the usage % of filesystems is relatively small, and by moving these to virtualized platform and reclaiming space you can get some substantial capacity savings.
The low usage % of filesystems is very often caused by problematic capacity / application sizing.
Posted by: soikki | September 09, 2009 at 04:45 PM
Soikki
You make some valid points -- any feature that helps recover from human error is a useful one -- that is, until we get to a world where we don't need storage admins formatting storage as you describe :-)
And, you're right, if you didn't start out with virtual provisioning, it'd be nice to reclaim space, but I'd argue that getting to virtual provisioning is far more easier with today's technology that lets you simply move LUNs around from old to new.
Your mileage may vary :-)
-- Chuck
Posted by: Chuck Hollis | September 09, 2009 at 05:52 PM
Another important piece that's missing here is search. If I've got a bunch of data with no metadata saying what it is, where it is becomes really important. How do I find the Jones report?
And when the system automatically moves the Jones report somewhere else because it hasn't changed in 6 months, how do I find it?
And when I've deleted it by accident and can't remember where it used to be (did I file it under 'J' for Jones, or 'P' for Prospects?) how do I get it back?
So who already does cloud, and storage, and search?
Posted by: Justin Warren | September 09, 2009 at 06:59 PM
"At the very least, storage vendors should offer at least file-level archiving as part of their storage stack. Bonus points awarded for having answers for email, SAP, SharePoint et. al."
A-freakin-men Chuck. Big time bonus points if they also include answers for legal hold, immutability, retention policy and destruction... all fully automated and policy based, of course! Cradle to grave storage and file systems - that's what I'm talking about (emphasis on grave).
Posted by: John D | September 09, 2009 at 07:51 PM
Hi Justin
Great question, let me clarify a bit.
First, all of the "moves" I'm talking about here are completely transparent to the user view. One example is file virtualization and archiving -- I still see my files where they've always been, it's just that they've been physically moved to cheaper storage.
Same with email archiving -- my personal email box has ~30,000 messages in it (most of them useless!), with the vast majority physically residing on purpose-built archival storage. It's completely transparent to me for the most part.
If you delete something by accident, again that's where either backup or archiving can help. Many environments choose to retain files/messages/etc. after they're deleted, others vigorously purge them to lower certain risks.
My rant on the importance of metadata was covered a few posts back: ("Of Files And Objects" and "The Future Doesn't Have A File System"), you might want to take a look.
Sadly, the only way to find things without metadata is to either (a) know where you put them, (b) inspect objects serially, or (c) use an external tool (like search) to generate external metadata.
Depending on how you frame the question on "who does cloud, and storage, and search", well, there are quite a few partial solutions in the market today, but no real good ones, IMHO.
At least, not yet :-)
-- Chuck
Posted by: Chuck Hollis | September 09, 2009 at 11:59 PM
John D:
"A-freakin-men" -- I guess I'm glad that we at EMC can walk that walk for many information domains, such as eDiscovery.
But, that being said, even with our robust set of current capabilities, sometimes I feel it's only a drop in a very large bucket.
Interesting effect I'd like you to consider, which is what I call "positive storage elasticity". Every time storage media costs drop considerably, people end up using far more than before, with the net effect that they spend more in total on storage than previously.
Put differently, I have yet to see an IT organization use a significant reduction in storage costs to actually spend less on storage. Inevitably, they end up using the savings to store far more than before.
Strange, but true.
Posted by: Chuck Hollis | September 10, 2009 at 12:04 AM
Chuck,
what tends to happen is people get lazier as the cost drops; arguably, they don't store more, often they simply store the same stuff again and again.
And even if they are storing more, how much of that more has any real value? Unfortunately it is hard to quantify Return on Information. I suspect there's some mileage to be had discussing such things and also on basic Information Hygiene.
Posted by: Martin Glassborow | September 10, 2009 at 03:56 AM
Agreed - all of these technologies play a role. There's not a single 'killer app' for more efficient storage.
No one need ask why EMC would want to make storage more efficient. History has shown that the more efficient ($/utilized TB)storage is, the more is needed.
Posted by: Pete Steege | September 10, 2009 at 10:17 AM
Hi Chuck. In my experience price declines are historically steeper than 30% but maybe I'm understating the effects of software.
For my historical planning assumptions with customers I typically use 60%-65%+ TB growth (in normal times) and 37% annual $/TB declines-- driven off of Moore's Law. Since most software is tied to TB's I think the two track together pretty well, but I'm open to suggestions otherwise.
This creates an interesting anomaly in the storage world...that is to say, very high growth in TB but very low growth in revenue. In fact, by my estimates, for spending to increase even modestly, your 60% growth rate, which is higher (and more accurate imo) than I've heard many execs in the business use (including JT), will eek out an ever so slight spending growth.
Academic? Sure...but my point is having watched this space for many years, TB consumption is highly price elastic. If vendors find ways to reduce data consumption and lower costs...users will consume more. Kind of like closet space...there never seems to be too much. Bottom line is I think the industry is short-changing growth expectations.
Here's my math:
http://www.internetevolution.com/author.asp?section_id=654&doc_id=177503
Posted by: Dave | September 10, 2009 at 11:52 AM
You guys over at EMC thought about buying Copan at all?
They seem to have some pretty impressive spin down technology and from what I read aren't doing so hot in this economy probably could get them for cheap. My company doesn't have anywhere near the need for those massive quantities of data but I'm sure you could find customers that have those kind of needs. Imagine what you could get integrating a DD box with a Copan array(896TB per rack with 1TB drives). That'd just be insane. I've never used Copan's stuff but thought about them when you mentioned spin down.
As for thin reclaiming I think it's a real useful feature to have. I can't count how many times I rebuilt file systems at my last company to reclaim space because something blew the volume up in size. Eventually I learned to just control things with LVM. And inefficient processes like MySQL's "optimize table", and even "alter table" are especially bad for thin provisioning as it causes MySQL to re-write the entire table out to a new set of file(s).
I have a NAS cluster right now, at the time the file system wasn't as thin friendly, the result is it's consuming 112TB of raw disk space for roughly 60TB of written data on the file system. Can't wait to reclaim that.. If I had thought about it more I would of just allocated less up front since the file system supports dynamic expansion, but schedules were tight and I didn't have time to think about "little" details like that!
I'm still waiting for the software to enable my thin reclamation in my array, should be here sometime soon, or so I'm told...
If you do buy Copan I want a cut for the idea!
Posted by: nate | September 10, 2009 at 05:52 PM
Hi Chuck,
I think the point that we'll spend more on storage strengthens even further when you consider the supporting infrastructure. The ever rising energy costs hurts infrastructure costs. So the technologies you outline have the additional benefit of greening IT.
Ranjit
Posted by: Ranjit | September 11, 2009 at 02:35 PM