While both events are interesting, I've been more drawn into the discussions and announcements coming out of Hadoop Summit.
Why? Let's face it -- the new world big data and predictive analytics is incredibly seductive stuff. And it's hard to have any big data discussion these days without bringing in Hadoop, HDFS and everything that goes with it.
Are We Deep Into The Hype Cycle?
The vendors can't contain their adrenalin, and we're reaching a point where some of the claims are getting a bit outrageous. And a few people are obviously going along for the ride. Not that it doesn't hurt to have Hadoop skills on your resume or CV these days.
Credit to @beaker and @stu for this fun graphic.
That being said, there's definitely very cool stuff going on, especially as people find new use cases for the platform in all sorts of interesting ways. I think we've just started to scratch the surface of what the platform (and the associated philosophies) can do.
Cool Data Viz Presentations
If you like brain food and eye candy, you'll like the keynotes at any event discussing big data.
I've now seen enough killer data viz that the effect is greatly diminished for me personally, but it obviously still has its magical effect to many.
Indeed, I think one of the most important skills in this whole domain is going to end up being creating effective data visualizations -- telling a story that engages people through creatively displaying information and relationships.
And that's a skill that doesn't require a PhD in math.
From An Infrastructure Perspective ...
One set of Hadoop trends I'm watching closely is around "Hadoop Becomes Enterprise Ready" -- extending the ecosystem and capabilities so it meets the needs of an average enterprise vs. a specialized research team. And there are a long list of topics that fall under that heading.
A related one is "Hadoop Becomes A Production Application" -- rather than being used mostly an analysis tool; integrating the Hadoop toolset into the general set of of components we use for production applications.
Much vendor-generated excitement about what's coming in Hadoop 2.0 (Hadoop Summit is sponsored by Hortonworks, who develops, distributes and supports Apache distro, dubbed HDP).
One new component is YARN, which I'm sure has an interesting meaning behind the acronym. At a high level, it's a resource manager / scheduler / pseudo-OS that helps enable alternate uses of HDFS-stored data in addition to the familiar MapReduce.
While there's no arguing the need for bringing more sophistication to resource management associated with multi-tenant Hadoop clusters, that's not the only game in town. VMware was quite visible at the show, announcing that Project Serengeti was out of beta, and the associated "Big Data Extensions" would be included with future versions of vSphere at no additional charge.
While I'm sure there is overlap between the two approaches, I'm sure each brings something unique to the table. When it comes to infrastructure resource management: compute, memory, storage, network, etc. -- my money's on VMware.
My guess is that we'll see both being used widely before long.
SQL Performance -- The New Battlefield
The growing importance of SQL in the Hadoop environment shows up in two ways. First (and most obviously) there's an enormous ecosystem, application set and trained workforce that knows how to work with SQL. You just can't ignore it -- nor should you.
Second -- and more subtly -- some are starting to see HDFS as a potential replacement for familiar data warehouses, and (gasp!) perhaps even light transactional duty. In this world, HDFS becomes the data substrate where *everything* is landed; the proverbial data lake.
My colleagues at Pivotal threw the performance gauntlet down when they announced Pivotal HD, which included HAWQ -- and literally blew everyone away in the SQL-on-HDFS performance realm. Much vendor noise inevitably resulted.
The response is a new open-source project (Stinger) which tries to do some of what HAWQ does today with regards to SQL performance against HDFS data. I'm a bit skeptical in their chances for success, there's an awful lot of domain-specific intellectual property wrapped up in HAWQ. Besides, if people are willing to spend big bucks for flash, they're also willing to spend big bucks to get superior SQL performance.
A Storage Discussion?
More interesting to me personally were the discussions of filesystem-like capabilities that were being worked on as part of Hadoop 2.0 and beyond: alternate protection mechanisms, snaps, etc.
From a storage geek's perspective, there's a lot of room for improvement in how HDFS goes about its business. Do a quick comparison between bog-standard HDFS and, for example, Isilon's HDFS-over-NAS implementation, and it's not even a fair fight.
While I think it's great that the community is talking up the need for a better HDFS implementation; we have yet to see an open source storage stack make a serious impact in the marketplace -- the preference still seems to be vendor-supplier code with an associated support model. Yes, ZFS came close, but it's still mostly a curiosity vs. a mainstream choice.
Lots of discussion around memory vs. disk: RAM in the server, flash accelerators in the storage stack, etc. Prices are going down, data volumes are going up, and no one is ever satisfied with the performance of their environment.
And, yes, today you can find all-flash Hadoop environments (backed by an archival data store) in the real world if you go looking. And I'm betting we'll see more.
Much Work Left To Do
If Hadoop is going to make a serious impact in the enterprise, it's going to need the same characteristics demanded by other enterprise platforms. Predictable performance. Consistent management. High availability. Data protection, backup and business continuity. Security and compliance. Non-disruptive upgrades. Cost-effective infrastructure that's consistent with everything else. And probably more.
Can all of that eventually find its way into Hadoop distros? Perhaps, in time.
But my guess is that even if that goal is eventually achieved, it won't be what enterprises want: they'll want their Hadoop environment to work consistently with everything else they're already responsible for.
The same performance management. The same operational processes. The same resource pool that IT uses for everything else. And so on.
The idea of standing up YAIS -- Yet Another Infrastructure Stack -- is not pleasant.
That's just one of the reasons I'm such a big fan of the work VMware is doing with Project Serengeti -- it's a logical extension of proven, adopted tools to a new class of workload: Hadoop and everything else that comes with it. And while it's early days, I think the approach will be very popular indeed.
I also think that Hadoop is but one tool in the belt when considering the bigger goal: helping to enable analytically-agile predictive enterprises, and all the extended platform and development capabilities that go with it -- and, yes, this is a rather unsubtle plug for what Pivotal is doing.
A New Universe, Or Just A Bubble?
It all depends on your perspective, doesn't it?
Following all the action at Hadoop Summit, you could easily come away thinking that Hadoop is a vibrant, thriving self-contained universe, and -- for some people -- you'd be right.
Or you could take the perspective that it's an interesting sub-plot to a much broader story altogether -- a fundamental shift in how we gather and use information to achieve our goals.
I, for one, see the latter.
Like this post? Why not subscribe via email?