Enterprise IT buyers are justifiably a demanding lot, which creates plenty of opportunity for IT vendors such as EMC to create products that meet their newer needs.
As you'll see, it makes exploiting the power of Hadoop and large unstructured data sets incredibly more efficient and powerful.
And before too long, I suspect other storage vendors will scramble to offer something very similar :)
What's This All About?
Big data processing and data science is quickly finding its way into more mainstream IT settings. Like the prospectors of yesteryear searching for minerals buried in the ground; this new crowd is searching for "digital wealth" -- a continual stream of fresh insights to power their businesses.
Loosely speaking, large data sets come in two primary flavors: structured and unstructured. And Hadoop has quickly emerged as the toolset of choice for exploiting unstructured and semi-structured data.
Hadoop is actually a rather large collection of tools, but one key storage-related component -- the Hadoop Distributed File System -- is responsible for managing very large datasets and delivering them at very high performance.
"Classic" HDFS has many familiar limitations, proving ample opportunity for EMC and others to innovate. Such is the case with new native HDFS feature just announced by Isilon.
In a traditional Hadoop implementation, HDFS assumes quintessentially "dumb" storage. Data is typically copied three times (twice in the rack, once in a separate rack). Many enterprises also keep a "safety copy" of their datasets on more traditional storage, so you're usually talking a 4x multiplier on basic capacity.
HDFS metadata is managed by a "namenode". It's a well-known single point of failure. Although a second namenode can be configured, it's really just a metadata logger, and assists with recovery vs. a failover as we'd all prefer to see. When the namenode is down (or recovering) work comes to a screeching halt. In larger environments, a namenode failure can take down the entire environment for quite a while.
Data has to be moved into -- and out of -- HDFS environments.
Typically, data is captured using traditional NFS, moved to HDFS where it is analyzed, and -- frequently -- then made available to analysts using Windows/CIFS. Moving a few gigabytes around is no big deal; moving hundreds of terabytes or petabytes can be a real sore spot; especially if you're doing in continually.
And, of course, you're not getting any useful work done while you're waiting for a 36 hour copy job to complete :)
There's nothing approaching modern data protection in most Hadoop environments: no snaps, no remote replication, etc. That might be tolerable if you can afford a very lengthy data outage; less tolerable if you're counting on the system to deliver useful results day-in and day-out.
Going further, there are no real tiering or archiving capabilities in HDFS, so users have to roll their own.
And -- in the subtle-but-important category -- there's no simple way to independently vary compute and storage performance or capacity. Everything is typically configured in identical storage/compute nodes.
Choose wisely :)
There's more, but you get the idea. When you consider Hadoop -- and HDFS -- in enterprise settings, there's plenty of opportunity to do things in a vastly better way.
The Isilon Solution
If you're not familiar with Isilon, maybe you should be. It's the leading and fastest-growing scale-out NAS product in the market, finding plenty of homes in both industry-specific and (more recently) enterprise settings. It's architecturally different than traditional NAS products from EMC (VNX) as well as NetApp and others.
No copying. That's big.
All the different Isilon protection flavors are available. No more 4x copies -- unless that's what you *really* want. Less money on storage means you can field far larger storage farms with the same budget.
Any Isilon node can function as a namenode on behalf of the cluster. That means that failover semantics are pretty much as you'd expect -- better availability with far less hassle.
And, of course, all the Isilon local and remote replication capabilities are inherited. Real, enterprise-class data protection if you need it. The popular Isilon performance and capacity management tools "just work". And so on.
Because storage is separate from compute, administrators can "tune" storage capacity and performance separately from processing performance. Less waste of one or the other -- depending.
All as a simple software feature. Very cool indeed.
What This Means
Lots of independent storage bricks. Lots of copies of data.
You're the one in charge of design, integration, support, maintenance, capacity, performance, etc.
The classic "one man band" :)
You now have a new, attractive approach you can consider -- creating a large, scale-out, self-managing and self-optimizing pool of "file capacity" that's transparently shared between intake (NFS), processing (HDFS) and analyzing (CIFS).
Even the hard core Hadoop shops have looked at what we're doing, and often cast a wistful eye -- if only we had made this available before they invested in all that gear.
We do take trade-ins, folks :)
The Greenplum Connection
As part of the new Greenplum UAP (unified analytics platform), Greenplum HD offers an enhanced enterprise distribution of Apache Hadoop.
That means that the new Isilon capability was developed with real-world knowledge of how proficient users actually use Hadoop. It means that EMC can offer enterprise-class support for both storage and software.
And for customers who prefer a one-throat-to-choke approach, the Greenplum data computing appliance offers a complete, turnkey solution based on the very latest technologies.
I've found that when executives get the data science bug, they want to move fast. And we're prepared to help them do just that.
Do You Want This On Your Next Storage Array?
Given that most storage purchases sit on the floor for three years or more, I'd encourage you to look out a bit further.
One part I like about this new capability is that it basically "future proofs" your investment in file system capacity -- should you wake up one morning and find yourself in a meeting to discuss a new Hadoop project :)
From a pure storage administration perspective, Hadoop (and HDFS) is no big deal if you're using an Isilon array. It's just another access mechanism to the exact same data.
Nothing really changes in the environment. Nothing new to buy. Nothing new to do. Current Isilon customers get native HDFS support at no charge as part of OneFS 6.5
It just works.
If you follow the storage industry like I do, there's historically been a lot of bantering back-and-forth over the years around what "unified storage" might really mean.
Regardless of the past, the future is now becoming clearer. It's quickly becoming a big data world that inherently favors purpose-built storage architectures that scale both performance and capacity with ease.
And, perhaps, it's also quickly becoming about the storage protocols and certified software stacks needed in this new big data world.
Like Greenplum UAP, Hadoop and HDFS :)