By now, the storyline ought to be familiar.
Around the globe, leaders in both the private and public sector are clearly besmitten with the amazing power of big data and predictive analytics.
Much has been written so far, with more coming -- two of my favorite pieces are here and here.
In this new hegemony, the ascendant rock star is the data science professional, or -- if you prefer -- the data scientist. Much has also been written about this realtively new and essential caste of knowledge worker -- for example, this piece I did a while ago.
EMC, through our Greenplum division, is "all in" on this fundamental shift on how data is being used to create new insights and new value. That's a rather bold position, as I still routinely encounter people who are still firmly entrenched in the comfortable and familiar world of data warehousing and business reporting, and can't understand what all the fuss might be about.
But I'm not despairing. I meet more than enough people around the globe who are extremly motivated by the unfolding new world of big data, predictive analytics and data science. And these people are likely to be paying close attention to what Greenplum is doing.
From enabling technology to investing in the data science communit Greenplum is among a small handful of vendors actively bringing the future into the here-and-now for so many of our customers.
And with today's announcements, you'll see clear evidence of that strategy in action: more investments towards building the industry's most powerful ecosystem for exploiting big data analytics.
Understanding The Value Drivers In Big Data Analytics
The key value drivers involved with big data analytics appear to be straightforward.
At the core, there's a small team of very valuable (and somewhat expensive) people who work their digital alchemy: extracting deep insight from petabytes of often unstructured and uncorrelated data.
Thus, it makes sense from our perspective to create enabling technology that maximizes the productivity of these critical teams, and that's what Greenplum Chorus is all about.
Using a social collaboration model, Chorus helps data science teams, projects and communities find and share what's important to them: data sets, content, findings, insights, other data scientists and so on.
Put that collaborative environment top of a self-service environment that makes resources and workflows easy to provision and manage, and the story gets better.
Support that powerful platform with enterprise Hadoop, the Greenplum database, and any number of analytical tools, and you're starting to get the picture. Add in specialized services (e.g. the Greenplum Analytics Lab) as well as purpose-built infrastructure if needed, and the picture gets even more complete.
For the data science teams, they've told us they really like how Greenplum Chorus makes it easier and faster to their unique work: either individually, or -- more importantly -- as part of a team.
With that context, it'll be easier to understand why I think these announcements are so important.
Greenplum Chorus Goes Open Source
That's right: the world's leading collaboration and productivity environment for data science professionals is now open sourced. Before too long, you'll be able to see it for yourself at www.openchorus.org.
Not only will this give you easy access to great technology, you'll also get access to a wonderful community of great people who are using and extending that same technology base.
There are many potential motivators for a vendor to move from a traditional licensed model to an open sourced model. In this case, the motivation was clear: to create an open ecosystem around Chorus.
For example, the list of wonderful proposed enhancements coming from our customers were vastly outstripping the engineering team's finite ability to incorporate and productize effectively. So, in some sense, the open model creates a fast-track for people who want to push the boundaries of what Chorus can do by building on what we’ve already done internally.
I think it's also fair to point out that open source is quickly becoming the de-facto expectation in the data science community: witness the success of Hadoop, R et. al. These people do appreciate enterprise class support when and where it's important, but they also appreciate open access to the source code if needed.
Kaggle Partners With EMC
If you're not familiar, Kaggle has one of the most fascinating propositions in the data science world. They've figured out a scalable and productive model for crowdsourcing global data science talent against specific sponsored challenges.
Personally, I find what they're doing simply amazing.
Here's the picture: on one hand, you've got an ocean of interesting data science puzzles to be solved. On the other hand, good data scientists are incredibly scarce these days.
Supply and demand appear to be sharply out of balance, and that's expected to be the case for the foreseeable future.
Moreover, there's a well-understood effect where domain experts may not always come up with the best predictive models as compared to people who might not be as acquainted with the topic. The supposed non-experts can often come up with valuable relationships and predictors that might be discounted by a domain expert.
In a nutshell, Kaggle runs open contests for data scientists. E.g. here's a set of sponsored contests, here are some relevant data sets, here's the prize -- may the best predictive model win!
Sponsors only pay for the best results. Participants get to tackle interesting problems potentially outside their familiar domains, as well as earn a bit of recognition in the data science community, not to mention that there’s some pretty good money on the table as well.
Kaggle has succeeded in creating a global marketplace of advanced data science talent, easily accessible to anyone and everyone. Stunningly brilliant, from where I sit.
As part of today's announcements, we're announcing an important integration between Kaggle and Greenplum Chorus. Chorus users can now easily extend their "virtual talent pool" to Kaggle's global and growing community of ~55,000 data scientists.
In a world of scarce data science talent, that's a very powerful capability.
Conversely, independent data science professionals can advertise their capabilities and availability for contract work to the growing community of Greenplum Chorus users.
It's an ecosystem play where everyone wins.
By the way, the existence of a large, global Kaggle community enables a rather intriguing opportunity to do analytics on the participants - who's submitting competitions, who's winning them, etc. As an example, check out this cool infographic on global Kaggle submissions for a quick taste of what's now possible.
Chorus Integrates With Gnip
A significant majority of data science work now involves correlation against social feeds: Twitter, Facebook, LinkedIn, Tumblr, Reddit, etc. etc.
Gnip has quickly emerged as the leading enterprise provider of value-added social data feeds, which in turn are an important ingredient in so many data science projects.
As part of today's announcements, both companies are announcing an API integration. Chorus users can easily discover Gnip's wide range of enhanced social feeds from within the Chorus environment, and quickly import selected data sets as part of a Chorus sandbox if needed.
Gnip takes care of discovering relevant social streams, adding in some value-added filtering and transformations (e.g. Klout scores weighting for Twitter sentiment), and supplies ready-to-consume data feeds for any number of data science projects. Chorus now exposes those capabilities, and makes them incredibly easy to consume by the data science team.
Again, an ecosystem play where everyone wins.
Chorus Integrates With Tableau
As I understand it, one of the most useful activities in any data science project is the first-level exploration of data sets -- a quick view of top level correlations that provide a "heat map" for subsequent exploration and drill down.
A useful analogy might be oil exploration: the first team in does a "big picture" survey that identifies a few key areas worth (ahem) drilling into.
Enter Tableau Software, the emerging go-to software vendor for this important part of the workflow. While there are many analytical products on the market, Tableau appears to have carved out a valuable niche around speed and agility: getting to first-level insights quickly and painlessly.
Recently, Tableau was a sponsor and participant (along with EMC) at the Human Face of Big Data event. As part of our "mission control" activities, we used Tableau to get near real-time analytical insight into what we were seeing from all those mobile apps that had been downloaded – getting to that that all-important "first glimpse" of the data patterns.
As part of this announcement, Greenplum and Tableau are announcing a set of integrations that enable the strengths of Chorus and Tableau to be combined -- moving into and out of either environment as projects progress.
Once again, an ecosystem play where everyone wins – much like Greenplum has done with SAS and Alpine.
Data Science -- A New Way Of Working
One of the things I inevitably encounter in most customer discussions is having to create a clear differentiation with the familiar world of data warehouses and business intelligence -- and the new world of big data and predictive analytics.
Or, put differently, "data science isn’t your father's BI".
The people who understand this key difference are now looking at the world differently.
They understand data science is an entirely new way to glean insights from vast quantities of seemingly uncorrelated information.
They understand the need for new roles, new expertise and new ways of working.
And they understand that tools and methodologies designed for the old ways of doing things won't cut it in this brave, new world.
If you fit this mindset, you'll appreciate what the Greenplum team is doing to drive the adoption of data science across every imaginable industry and public policy pursuit.
Quite clearly, it's an ecosystem play.
Even though today's announcements represent exceptional progress towards that goal, I think we'd all agree -- there's still a lot more to be done!

Why did he hack me? Please tell me? Why play with another's life?
Posted by: William | October 24, 2012 at 11:35 AM