It must be Data Science Week here at EMC.
I've already written about our latest findings from the EMC Data Science Survey, and shared how EMC Education is offering their first Data Science Associate coursework to support the increased demand for knowledgeable individuals.
It's almost inevitable that all of this would be leading up to an EMC product announcement, and -- yes -- that's the case.
The new Greenplum UAP (unified analytics platform) is especially exciting to me because it appears to be the first integrated software suite that is precisely targeted at the newer style of data science teams.
Here's the pitch: in the new world of big data science analytics, all the value comes from the data science team. Since it's pretty obvious that (a) knowledgeable data science professionals will be very scarce resources for the forseeable future, and (b) their time is extremely valuable, then ostensibly makes sense to build tailored software environments that maximize their productivity.
That, in a nutshell, is what makes Greenplum UAP so unique and compelling.
The Back Story
I had to go back and check -- it's been about 18 months since EMC acquired Greenplum. That might not sound like a long time, but it feels like big data analytics have been top-of-mind here at EMC for much longer.
Much of our activity has been to understand the needs of the data science community: who are these people, why are they important, how do they work, what kinds of new solutions do
they need?
That sort of legwork leads to a deeper understanding of what's really going on in this space, vs. merely glomming onto the latest tech trend (e.g. Hadoop).
One major milestone occured when we held what was most likely the industry's first-ever data science conference at EMC World last May. This week's data science survey is another highly visible milestone.
The amazing impact of these data science teams is now extremely visible, and becoming very well understood.
Once the shift is made from analyzing history using a limited number of internal data sources -- to modeling and predicting future behavior using a vast number of external data sources -- well, there's no turning back. It's a seismic shift in the competitive ante.
Understanding The New Competitive Ante
The trick is to aim all this analytical firepower at core business processes that really make a difference.
Let's say you're a mortgage underwriter. At a very high level, your core business process is pricing risk. If you think about it, pricing risk is basically an exercise in attempting to predict the future. Learn how to do this better than your competitors, and you'll (a) make more money, and (b) do a good job of taking business away from your competitors.
Take a look at this chart, and appreciate for a moment what it might mean to add relevant factor after relevant factor to the risk pricing model.
Start with employment history, move to home price trending, add in historical loan data, bake in some census trending, leaven with a bit of gegraphical hazard risk, saute with local job market trends, and sprinkle with professional and social history of the applicant.
Big Brother concerns aside, you'll have to admit that anyone who uses these additional factors will do a better job of predicting risk than someone who doesn't. And, please keep in mind, the list of relevant factors here is at best a partial list -- there always will be new candidate data sources available to improve the effectiveness of the core business process.
Indeed, the competitive advantage model quickly shifts in favor of those organizations who become adept at seeking out new relevant data sets and integrating them into their model.
Or, let's say you're in the business of providing health care, or insuring health care. You'd like to move to an evidence-based model for recommonding optimal courses of treatment.
The same "big data" effect can be clearly seen here as well -- the more divergent data sources you can add to your model, the better your ability becomes for predicting healthcare outcomes.
And, if you think about it, there's no shortage of potential data sets that might be eventual candidates.
Is it any wonder that smart business leaders are now trying to figure out how to build and empower these new data science teams?
Understanding The New Roles
We've now encountered enough proficient data science teams that we now believe we have a good understanding of the key roles, and -- more importantly -- how they're markedly different than more traditional BI models.
I think that most people by now are starting to get a deeper appreciation of the lead role (e.g. the data scientist), but there's also a lot that can be learned from the important supporting roles as well.
This chart is a decent representation of a modern data science team, and -- more importantly -- how the new Greenplum UAP provides an integrated platform that works the way these people work.
From the bottom, we've got the interesting role of the data platform administrator. I've come to understand this as a "data logistics" role: staging large data sets in and out of the environment to make them usable by the data science team. Maybe that doesn't sound like much of a role when we're talking gigabytes, but it certainly becomes interesting and compelling when we're talking many, many petabytes :)
The somewhat new role of "data engineer" understands the nuances of the source data: where it came from, how it was captured, unique contextual aspects, what the metadata might actually mean, and so on. It appears that the role of the data scientist and the role of the data engineer are becoming intertwined in interesting ways.
Most data science team have multiple analysts: data analysts who can help communicate the key insights gained from the data scientists, and business analysts who can help communicate what the impact those insights might mean for the business in terms of changed processes and approaches.
More and more LOB users are starting to realize they need to have an increased appreciation for what the data science team does, and are investing in their skill sets to facilitate communication and action from key insights.
The Greenplum UAP
Greenplum's mission at EMC is simple: create the best platform available for the proficient data science team. Yes, the individual components matter, but what really matters is how the pieces come together to dramatically improve the productivity of the team.
If you've been following the story so far, you're probably somewhat familiar with the Greenplum Database, and -- more recently -- the Greenplum Hadoop offering. Both are ground-up designed for the world of scale-out big data analytics support. Both have extended toolsets that directly address the emerging roles of the data platform admin and the data engineer.
Both can easily run on the customer's choice of sourced hardware, or in a fully-virtualized environment such as a VCE Vblock, or perhaps using the new integrated Greenplum Data Computing Appliance.
Just to be clear, this is mostly about software, not hardware :)
One of the distinguishing characteristics of the Greenplum UAP is its inherent openness to the team's choices of analytical tools. Proficient data science teams usually employ a wide range of tools to do what they do; making the traditional vendor-specific approach less than attractivee. Other than the open source tools that are associated with the Hadoop distribution ("R" and MADlib for example); EMC is not directly in this business.
What Greenplum *does* do that is turning out to be a big deal is to enable the data science team to create and push analytical functions into the database itself -- extremely close to the data -- and to gain huge performance advantages as compared to the traditional extract/analyze process we're all so familiar with.
Greenplum Chorus
Once you get past the enabling technologies: platforms, analytical tools, etc. -- you'll realize that there's a clear opportunity to create an environment where the data science team can work effectively together.
One easy-to-understand example is the ability to search, browse and quickly visualize data sets. One data scientist told us they spent significant time simply rooting around in their collection of data sets to understand what was out there, and then writing code to visualize what was in each data set. With Greenplum Chorus, that's now an extended search function with built-in visualization -- a huge time-saver.
Another core function in Chorus is the ability to quickly "sandbox" data sets to run quick experiments. To the extent that data experimenters can do what they need to do without an administrative workflow process -- well, that's a huge boost in productivity. Indeed, I'm guessing it won't be too long before we see advanced products like VMware's vFabric Data Director being incorporated as part of these newer environments.
Then there's the collaboration aspect of Chorus: the ability to attach documents, note key insights, and freely share the version-controlled results with other members of the data science team -- including the LOB executives who are writing the checks!
Are there any key or revolutionary technologies in Chorus? That depends on your perspective: to the best of our knowledge, what makes Chorus unique is that it's built entirely on detailed observations of proficient data science teams and how they go about their work. There's not a lot of legacy thinking you'll see in the product.
Does A New Approach Really Matter?
It's not hard to make a compelling case that big data analytics and data science teams are a clear and sharp departure from the more historical and inward-focused BI teams that preceded them.
If that's the case, it makes utter sense that the platforms and tools they'll need are also a clear and sharp departure.
Not to be overly competitive here, but we believe that the Greenplum UAP approach is very differentiated from -- and ultimately more appealing than -- some of the more familiar vendor approaches.
Take Oracle, for example. Most of their core data management technology was conceived back in the 1980s, meaning that to get decent performance, you've got to throw a lot of custom hardware at it, and still end up achieving sub-standard results. The other thing we encourage people to remember about Oracle's approach is that the concept of "open" and "community" are, well, rather limiting.
When it comes to data science professionals with advanced degrees pushing the frontier of big data analytics, it's best to have an environment where many flowers can bloom, so to speak.
IBM is no slouch when it comes to the world of big data analytics -- at least, in aggregate. All the pieces are there, but you have to ask how they come together to "move the needle" for their customers? More than one industry analyst has used the term "kitchen sink" to describe perceptions around IBM's offering. Frankly, I have to agree. Going through their portfolio is like touring a DIY store: lots of interesting pieces, but some assembly required :)
Getting Started
Obviously, there are lots of things EMC is doing to help our customers move ahead on their big data analytics proficiency agenda.
Perhaps the most noteworthy one recently is the Greenplum Analytics Labs -- pre-packaged workshops that help our customers see what's possible with their data -- and our data scientists.
Once you've shown a business leader what's possible in the new world of predictive analytics -- especially around core business concerns that deliver meaningful competitive advantage -- well, the case for investment is largely made :)
The Game Ahead
2012 looks to be the year where a significant number of business leaders either start to invest in creating a data science function, or decide to accelerate their existing investments.
The big lever is unquestionably talent: attracting new talent, raising the proficiency of existing talent, and making your team more productive by giving them an environment that works the way they do.
Personally, I'm rather pleased that EMC and Greenplum have decided to ditch the more mundane approaches, and go directly for the big pot of gold: creating advanced platforms for tomorrow's data scientists.

Hi Chuck, thanks for this post. It has inspired me to play with Greenplum DB and predictive analytics, and my experiment is now grown into a web-service. I wrote about it in my blog, I thought it could be of interest to you: http://rate-loans.blogspot.com/2012/01/on-big-data-analytics-peer-to-peer.html
Posted by: Alex L | February 24, 2012 at 02:22 PM