Yesterday was Atmos and COS (cloud optimized storage) -- essentially storage at internet scale.
Today I'd like to delve back into the flash discussion -- storage at near-memory speed.
I think it's interesting because many of the industry people looking at this issue are travelling down the exact same discussion path that EMC did way back when, long before Jan 2008 when we announced EFDs (enterprise flash drives) for the DMX.
Like any journey, we asked ourselves many of the same questions that are just now coming up in the broader discussion. At that time, we arrived at a consensus view, and continued our journey.
It's now been almost a year since we've had these products available for customers. Not to point out the obvious, but (as of today), the only mainstream arrays that support enterprise flash drives are the DMX and CX series from EMC.
I'm guessing we'll be seeing the very first flash-based offerings from other vendors next year -- which will lead to the inevitable shifting of the discussion from "good idea or bad idea" to "which vendor is doing it better?"
That's how markets evolve.
Flash memory can easily play two roles in a server/storage architecture. You can think of it as "cheap DRAM" that augments the ability of a server or storage array to cache data, or "fast disk" where it directly replaces spinning media.
This one's got some healthy discussion going, because -- in all actuality -- flash can easily play both roles. Strangely enough, many people are looking for the "right" answer. I tend to think of it in terms of "pros and cons".
Flash As Cache
Turning to the "flash as cache" discussion for just a moment, the thinking is that modest DRAM amounts used for storage caching can be beefed up with ostensibly cost effective (albeit slower) NAND.
While this is true, it brings up a few observations:
- Flash is not as fast as DRAM, so the implication is that there will be a level of logic that decides which data lives in DRAM, and which lives in NAND. Those algorithms are important to get the most out of any caching scheme. Whether this function lives on the server, or lives on the array -- these algorithms are not off-the-shelf technology.
- Our experience with storage caching (e.g. a decade-plus of Symmetrix with ostensibly a very large cache) shows diminishing returns as you add cache for many workloads. Sure, things run a bit faster, but at some point spinning disks get involved, and you generally see declining benefits from paying for additional caching.
- If "cache" is to be written to, extra design costs can be involved -- remember discussions around nonvolatile cache? Read cache is simple, write cache takes a bit of careful thought that you don't lose changed data in any scenario.
Will we see NAND (flash) being used to build server-based storage caches, as well as find their way into storage array cache designs? Most certainly -- it's an interesting ingredient.
But -- from a storage geek's point of view -- cache is cache -- we all know what it can and can't do.
Flash As Storage
This one is far more interesting, because we see more than a few discussions where server vendors are trying to say that flash in a server is equivalent to flash as storage.
Yes and no.
Yes, in much the same way that a bare disk drive sitting in a server enclosure is storage in its simplest form.
However, implemented that way, it can't easily pooled or shared, needs some form of data protection (e.g. RAID), it's more difficult to make local and remote copies, it's really hard to tier and make part of an ILM scheme, it's difficult to manage as a separate entity, and so on.
That's why most enterprise storage sold today is external to the server, and not internal to the server.
I've written before that there's a bit more involved in doing this than just shoving an existing EFD into an existing array, especially for those vendors who have a "spindle randomizing" approach to their design.
My Prediction?
Flash as cache will eventually become less interesting as part of the overall discussion -- there are no dramatic differences for most use cases in implementing storage cache with NAND, DRAM or some sort of mix.
Interesting, but not so much.
Flash as storage? Well, that's going to be really interesting, simply because the differences in what applications experience are so spectacularly dramatic. As we've seen time and time again, just moving a few hot spots to an EFD device can make an eye-popping difference in what users experience.
And that's *after* caching has done its job ...
Looking a bit farther out, as EFD prices tumble to points that are less breathtaking than today's (a process already well underway), there will be more and more of a case to put more and more data on these devices.
And, like the disk storage that came before it, you'll see most of going into external storage.
The Limits Of Big Disks?
There's also been some interesting parallel discussions recently about how disk capacities are ballooning, but I/O rates per device are holding relatively steady.
Throwing more SATA disks at the problem boosts I/O rates somewhat, but at considerable expense -- as many smaller vendors who built offerings on this particular premise are finding out.
Use 2, 3 or more SATA drives to handle a demanding workload, and costs mount. Unused storage. Cabinets to put them in. Power to keep them all spinning. Data center space to rack them all.
Not to mention the uncomfortable optics of people asking difficult questions about all that unused storage.
We've heard stories of customers asking particular vendors about this, and the glib answer comes back -- "disks are cheap". Well, they're not -- not if you really look at what you're doing.
And, even after all of that, there's a hard limit to how fast data can be moved to a SATA drive -- no matter how many of them you use -- a limit that EFDs comfortably blows past.
Now, to be fair, there are many information access patterns that are well-suited to big drives -- and there will always be a ready market for the next-largest drive.
I'm just expressing my view that -- before too long -- we'll see far fewer vendors who try to get adequate performance by throwing lots of cheap drives at the problem, and far more solutions built on the intelligent use of all available technologies -- including EFDs, if I'm not mistaken.
This Is Fun
Why? Because I happen to be lucky enough to work for a company that saw this coming years ago, and had the intestinal fortitude to make the big investments required to bring EFDs to market at least a good year ahead of anyone else.
This isn't about bragging rights (ok, so maybe it is, just a little bit), the real story here is that EMC now has a substantial lead on understanding how these devices work in real-world customer environments, and are using our learning to busily work on the next round of technologies and integration.
This isn't about SPC benchmarketing.
We'll see if EMC called this one right or wrong in a few years.
That's the fun part -- customers decide, not vendors.
Courteous comments welcome as always!

I have to agree in the most part.
We have a saying over here (maybe you do too)... "the proof of the pudding is in the eating..."
The problem I see with raw "flash as disk" is the user has to decide which one or two applications to put on them. Since afterall, they ain't cheap.
The problem with "flash as cache" is that - if it really was that simple we'd all be putting TB of cache in our controllers today.
Somewhere in between, now thats interesting ...
Posted by: Barry Whyte | November 11, 2008 at 05:17 PM
Barry -- yes, very interesting indeed!
Posted by: Chuck Hollis | November 11, 2008 at 06:01 PM
"Flash is not as fast as DRAM, so the implication is that there will be a level of logic that decides which data lives in DRAM, and which lives in NAND. Those algorithms are important to get the most out of any caching scheme. Whether this function lives on the server, or lives on the array -- these algorithms are not off-the-shelf technology."
Barry,
Great discussion, I wanted to point out a couple of related things, specifically to the quote above.
You are absolutely correct, NAND is not as fast as DRAM. It has much higher latency, and the bandwidth per-chip is much lower. However, the second of these two issues - bandwidth - can be compensated for through parallelism.
For example, an array of NAND chips can have bandwidth that's not far off from that of a DRAM memory module - roughly a gigabyte per second when it's done right. Next generation can double that at roughly 2 Gbytes/s. And, if it's on PCIe, which chipsets have tons of, it can scales across multiple modules the same as DRAM does in it's multiple slots. For example we get 6GB/s from a hand full of our PCIe modules.
Latency, on the other hand, cannot be "fixed" by parallelism. However, in a caching scheme, the latency differential between two tiers is compensated for by choice of the correct access size. While DRAM is accessed in cache lines (32 bytes if I remember correctly), something that runs at 100 times higher latency would need to be accessed in chunks 100 times larger (say around 4KB).
Curiously enough, the demand page loading virtual memory systems that were designed into OS's decades ago does indeed use 4KB pages. That's because it was designed in a day when memory and disk were only about a factor of 100 off in access latency - right where NAND and DRAM are today.
The VM paging subsystems in OS's today do use very sophisticated schemes (MRU/LRU look-ahead, etc.) for determining what should be paged in / out. A lot more focus has been put on this lately as HDD's have steadily become "further away" from DRAM over time. Take for a case in point Vista's ready boot work, which was all about making those heuristics "smarter", so users aren't waiting around for the application to page back in.
There's another important trend that affects this discussion - applications are increasingly being designed to take advantage of multiple CPU cores through multi-threading. What that means is that, while one thread is waiting for a 4K page to come in, other threads can proceed - thus keeping the CPU's busy at all times by the naturally occurring staggering of page loads. This can totally mask the higher access time to NAND. Then it's only a question of bandwidth and miss ratio...
In other words, if I have enough bandwidth to handle say a 25% miss ratio in my NAND tier, and if my application is multi-threaded enough, NAND will appear just as "fast" as the DRAM, because the mechanisms do already exist in OS's today - and are actually not that far from being well tuned.
Don't know if this changes your thinking at all about the problem, but I appreciate the chance to participate in the discussion - feedback welcome.
David Flynn
CTO, Fusion-io
Posted by: David Flynn | November 14, 2008 at 10:42 PM
Are drive manufacturers looking at EFD too? The hot-spots can be moved to these (like they do with existing onboard cache) but with reduced risks arising from power outages. Ditto for RAID controller HBAs and reduced headaches with cache battery.
regards
sudhir.brahma@gmail.com
Posted by: Sudhir Brahma | November 16, 2008 at 01:30 AM
Yes, some of the drive manufacturers are looking at hybrid devices.
I think the interesting part will be the relative effectiveness regarding progressively larger "domains of optimization" -- are we optimizing over a single drive, a group of LUNs, an entire array, a group of arrays, multiple data centers, and so forth.
And, not to push it, but one useful way of thinking about Atmos is "optimizing over a really, really big domain".
Posted by: Chuck Hollis | November 17, 2008 at 11:54 AM
I came to that blog thinking that you are speking about flash design. How I can see it is devoted to flash as a way to store information mostly.
http://astra-design.com/flash_design.html
Posted by: Flash Designer | December 01, 2008 at 07:56 AM