This week, rumor has it that NetApp is going to make a big announcement around their new release: ONTAP 8.1.1
I work for EMC, so I'm expected to have an opinion, and not necessarily a positive one. Despite my obvious affiliation, I'm also pretty deep into the storage industry overall (20 years!) and spend a lot of time with customers who use a lot of storage.
And -- in a nutshell -- this announcement is not especially good news. Particularly for their customers who've been waiting patiently for NetApp to close the gap against viable competitive alternatives.
Before long, I can see that many of these NetApp customers are going to have to make hard and difficult choices between whether they stick it out with their current vendor and hope for the situation to improve, or start to look at bringing in competitive alternatives. And while we could sit here at EMC and perhaps gloat a bit, that's really not what this is all about in my mind.
Customer storage requirements and technologies are moving so fast that it's becoming more difficult for even a focused storage specialist like NetApp to keep up.
The Background And Disclaimer
I've worked for EMC for 17 years. For several of those years, NetApp has competed vigorously with part of EMC's portfolio, namely the VNX -- previously the CLARiiON and Celerra.
The back-and-forth between the two vendors has provided both endless entertainment as well as occasional annoyance for many.
The two companies individual strategies couldn't be more different: EMC with its very broad range of storage and non-storage technologies, and NetApp with its mostly singular focus on a single storage operating system.
While the merits of each approach are discussed widely in the investment community; for customers it's a more pragmatic decision: who will I invest in to help solve my challenges?
And, competitive rancor aside, I believe 8.1.1 in its current form is not particularly good news for their customers.
The roots of this go way all the way back to November, 2003 when NetApp acquired Spinnaker in an effort to bring modern file system DNA into their ONTAP environment. After a few experiments in the 7.x world, the result finally hit the market late last year in the form of 8.1 in general release with some severe restrictions; we now are being presented 8.1.1 as a release candidate.
And, after spending a few days getting to know what's in this release, if I were a NetApp customer I'd start thinking about Plan B.
Big disclaimer: all of the discussion here is based on a mix of official and unofficial sources. Some of the specifics here may eventually turn out to be incorrect, in which case I'll come back to the comments section and clarify. And, while certain EMC product divisions compete vigorously with NetApp, we all sort of like having a tough competitor to go wrestle with in the marketplace -- it makes everyone stronger in the long term.
The Big Trends
If I were to boil down what ONTAP 8.1.1 is all about, it's about attempting to take ONTAP forward into the modern storage world in three key areas: flash, scale-out and continuous operations.
Flash storage is a performance discussion; scale-out is more nuanced around capacity, performance and simple operations, and continuous operations is -- well -- really darn important to certain audiences. None of these are new discussions for EMC or anyone else who's deep into the storage business.
So, let's take a quick look at what's in ONTAP 8.1.1 and how it compares to other alternatives.
The Flash Discussion
Enterprise flash is all about paying less for performance. Old school: use lots of spindles. New school: put your hot data in a flash storage device and watch how things fly.
Flash implementations break into two broad categories: cache and persistent storage. They’re not the same thing.
Flash as cache has definite value: EMC continues to invest in various flavors of storage caching (FAST Cache and FAST VP on VNX, VFcache, Project Thunder, etc.). Historically, NetApp invested only in array-based read cache in the form of Flash Cache, formerly known as PAM.
With this announcement, NetApp introduces a new array-side caching construct -- Flash Pools -- which are different than what they've done in the past.
A NetApp array is comprised of one or more aggregates, on top of which are carved volumes and file systems. A Flash Pool is a read-write cache that's bound to a specific disk aggregate in a NetApp array.
This is big for them, because a random write workload presented problems for the classic NetApp filer. Flash Cache (or PAM) wasn't any help -- it was read-only -- and when the modest onboard NVRAM got filled with writes, classic ONTAP would write to each aggregate sequentially with the expected unpleasant outcomes. Of course, we as competitors would point this out as much as possible!
Now, NetApp customers can upgrade to 8.1.1, and purchase read-write SSDs to cache their traditional disk drives, avoiding this particular problem. NVRAM destages are now done to SSDs, so less bottleneck.
Case closed? Not really.
Generally speaking, any money you spend on SSDs under 8.1.1 can only be used for caching -- they can't be used to actually *store* any data in a persistent fashion. As a result, there's no storage tiering function as you'd find in a VNX or any number of other arrays on the market.
Now, to be totally accurate, you *can* use SSDs for storing data under ONTAP 8.1.1 if the entire aggregate is comprised of SSDs, but that ain't tiering, is it?
I'm sure there will be many statements to the effect that you don't really need storage tiering, or perhaps they hope the name of the function (e.g. Virtual Storage Tiering) might throw you off the scent -- but there are more than a few common workloads that clearly benefit from a combination of caching *and* tiering.
If you’ve got diverse workloads, you’ll want access to both approaches.
It gets more problematic when you realize that the expensive SSDs that are used to implement Flash Pool caching are bound to specific aggregates; they're not a shared or pooled resource.
Not a big issue if you've only got one or maybe two aggregates, but anything beyond that and you'll have to make some pretty smart decisions about how you set things up, because it'll be difficult to change things -- reconfigurations, moving data sets around, etc.
In larger environments, that's a double hit if you think about it: one hit on efficiency (SSDs aren't pooled across aggregates) and a second hit on complexity both during configuration time and manually re-balancing workloads (e.g. copying stuff around) as data access patterns evolve.
Neither issue exists on a VNX, in case you're interested.
The Scale Out Discussion
Scale-out environments are fundamentally different than the traditional mid-tier storage environments so many of us are familiar with. Yes, there is a lot more data.
But little things can become big things simply because of that enormous scale. Much discussion has been bantered around in the investment community about how EMC's Isilon represents a serious competitor to NetApp in very large file-sharing environments, and there's definite merit to that line of thought.
What the investment community tends to miss is an important point: the Isilon product is built from the ground-up as a scale-out architecture -- it's not an adaptation. That's why we acquired them.
And, viewed through this lens, you can see where the scale-out features of 8.1.1 "cluster mode" are targeted.
It took me a while to wrap my head around what NetApp is currently doing here, so I'll share with you a quick synopsis of what I understand. The best source I've found is TR-4078 which recounts the best practices as they stand today.
At a low level a NetApp cluster is an aggregation of nodes arranged in failover pairs in a 1+1 arrangement as opposed to the N+M approaches you'd find in natively architected approaches, such as an Isilon cluster.
Everything looks mostly reasonable for modest-sized file system access, until you start looking at their approach for large data sets. Enter the world of “vServer” and the quintessentially named "infinite volumes". If you want to do seriously large file systems (like an Isilon normally does), there are some interesting restrictions in play.
In ONTAP 8.1.1 to get meaningfully large file systems, you start with a dedicated hardware/software partition within your 8.1.1 cluster. This partition will support one (and apparently only one) vServer, or visible file system. Between the two constructs exists a new entity: the "infinite volume" -- an aggregator of, well, aggregates running on separate nodes.
This "partitioned world" of dedicated hardware, a single vServer and the new infinite volume is the only place where you can start talking about seriously large file systems.
While there's some indication that popular ONTAP features (e.g. snaps, compression) are intended to be supported, there's a meaningful difference between snapping 5TB and 20PB. We'll leave that squishy area alone for now -- as well as compatibility the extended software portfolio -- until we get more information.
In addition to the well-understood ONTAP overheads around RAID 6 (er, RAID-DP), right sizing, volume overheads, etc. there's a nice note that you shouldn't let your usable capacity get above 60% if you expect reasonable performance.
I would view that as a tax on top of a tax.
And, since we're targeting really big file systems, we're talking about a really big tax.
While the capacity numbers and associated utilization might somehow be tenable, it looks like admins are going to have to start to pay serious attention if they're concerned about performance.
There's no automatic load balancing against available node and storage resources. Yikes!
The best practices note goes into some detail about how best to set things up when you do an initial data ingest, but the concerns become more real when you start to realize what happens when you add new capacity -- which, of course, is what scale-out is all about.
Typically, when you add new capacity to an ONTAP aggregate, it redirects all new write traffic to that new capacity until it starts to get full, and at some point goes back to a more randomized approach across all available capacity vs. the new stuff you just added. That means that all that traffic gets funneled to a specific node (and disks) until it hits some watermark – which may or may not be a concern depending on your write ingest rate.
And, of course, when you turn around and read it, it all funnels through that same node (and disks) where it was written en masse. NetApp will point to a round-robin access feature to the various nodes, but that doesn't really help if all the interesting data is bound to one or two nodes, or a limited amount of new storage. Youch!
Compare that with the autobalancing capabilities found on Isilon, and you'll quickly appreciate the architectural difference.
Only if performance matters to you, that is.
There's an associated challenge with metadata handling: all metadata in an "infinite volume" partition appears to funnel to one physical location, albeit a redundant one. Again, not a problem if metadata usage (e.g. file directory access reads and writes) are within the performance envelope of that particular node. And there isn't any mention of Flash Pools.
Again, none of this is an issue on an Isilon approach -- metadata is fully distributed just like everything else in the cluster. And Isilon has supported flash drives used as metadata cache for quite a while, thank you.
Finally, we've seen NetApp start to position 8.1.1 with prospects as a really big block device.
Within Netapp, there appears to be some controversy on the merits of this approach in a low-latency block-oriented environment. Clearly not supported in the vServer/infinite volume world as described above, as one example.
But in a more normal cluster, there's some serious node-hopping going on as I/Os are routed from source to target. Not really a big issue for patient file-system protocols, or perhaps even moderate block usage. But as you start talking serious levels of block I/Os, you start looking at potential routes, and you can easily come away with more questions than answers.
Again, not apparently purpose-built for the task at hand.
I feel bad about having to drag you through all this, but it took me some time to understand what the heck was going on here, so I thought it was worth sharing :)
There are a handful of new features in ONTAP 8.1.1 (e.g. transparent volume migration across controllers) that help lessen the impact of some of the normal disruptions NetApp customers tend to experience, especially in larger settings. That's a good thing. Always room for improvement, yes?
But to brand their current capabilities as real-deal continuous operations (in the full meaning of the term) seems to be stretching things a bit, especially when you compare them against purpose-built environments from EMC and other vendors.
So far, the pitch to customers has been "good enough for less money".
That may or may not be the case, but I think it's good to go with a vendor whose capabilities generally exceed what you're asking for, and you can dial things back if needed, rather than positioning every customer for a compromise.
Is Compromise A Good Thing?
While compromise might be a good thing in the political arena, it's usually not a desirable state of affairs in demanding IT settings.
When I look at NetApp's latest release, I see "compromise" written all over it.
Yeah, NetApp does flash -- sort of.
Yeah, NetApp does scale-out -- sort of.
Yeah, NetApp does continuous operations -- sort of.
Yeah, NetApp supports large block environments -- sort of.
With this release, it's pretty clear that -- from NetApp's perspective -- compromise is OK.
Maybe compromise is a good thing for NetApp, since they're up against a strong competitor who looks at the world very differently.
The real question is -- how many of their customers will agree?