Gardens are wonderful things.
You plant your seeds, and they hopefully sprout. You obsessively tend for weeds, pests and the occasional wandering child.
And then it's payoff time -- a cornucopia of wonderful fruits and vegetables.
Yesterday I spent most of the day sitting in on EMC IT's first-ever big data user summit -- 60 people from across the business, all learning and sharing from each other. I couldn't help but think -- what a wonderful garden we're growing these days: there's plenty at hand to harvest, with an even bigger bounty on the horizon.
There are certainly organizations far more proficient than EMC's current internal efforts around big data analytics. But most that I encounter look longingly at what we and others have done, and desperately want to get their organizations moving in the same direction.
To the extent that we can share our learning, methodologies and experiences -- perhaps we can help others who haven't started their journey yet.
To Begin With
Historically, EMC has not been the most analytically proficient company when it comes to our internal operations. Sure, we used BI and reporting, and you'd occasionally see a simple analysis using limited data sets, but those tended to be the exception rather than the rule.
Along the way, so many of us became completely enamored and bedazzled at the potential that lay at hand. And as we looked inside our own company, we realized we had some heavy digging ahead to transform how we did business internally.
Time to get busy planning our big data garden.
Not to shy away from the challenge, we plunged ahead. Big data analytics proficiency was clearly communicated as a new organizational goal. A senior executive was appointed to act as an executive sponsor.
And EMC IT did something exceptional -- they built on their existing ITaaS capabilities to establish a business analytics as-a-service (BAaaS) platform.
The party really got going a bit over a year ago: clear mandate, platform available, resources at hand, etc.
What can we share about learning to grow a corporate big data garden?
A lot, it turns out.
Pragmatic Learning At Hand
These slides come to you courtesy of Malte Bernholz, VP of Corporate Consulting at EMC. Not to be confused with EMC's global services group, the corporate consulting team is our in-house business consulting service.
At the beginning of the year, he was asked to study all of our various internal analytical efforts, spot the trends and make key recommendations.
His presentation was just one of many very interesting talks as part of this all-day event.
As of January 2013 (about a year into the program) he quickly found 17 ongoing initiatives, with 6 more planned. While we've sanitized the actual initiatives, they span multiple parts of EMC's business -- from marketing to HR to IT to finance to engineering to manufacturing to customer service.
The takeaway for me was simple: big data analytics is a toolset and methodology that can be used across the business.
Imagine if instead we had stood up 17 separate analytical puddles across the business. Besides being hideously expensive, it wouldn't be nearly as effective: there wouldn't be the pooling and re-use of data sets, models and expertise.
You can grow all sorts of great things in your big data garden if you plan accordingly :)
Not all initiatives are true "big data" ones, either. Three of the projects were mostly about simply sourcing the right data, and applying simplistic analytics.
Five of the projects were mostly about better analytics on existing, structured data sets. The majority (9) used many diverse sets of data, as well as sophisticated analytics.
Over time, projects tend to move up-and-to-the-right: more data, more data sources, more powerful analytics. And there are six new ones that aren't shown here.
This internal population now gives us the opportunity to understand what works -- and what doesn't -- when it comes to growing a big data garden.
The Three Big Data Capabilities
Malte breaks down the observed successful model into three logical components: the team, the process and the platform.
We as technologists inevitably become obsessed with debating the merits of the platform; the real action is elsewhere.
Fail to get the team right, and the best process/platform won't help. If the team fails to use discipline in their process, no platform will help.
It's only when the team and the process is lined up that we can then have a legitimate discussion about the platform.
Getting The Roles Right
Malte identified five key roles that lies at the heart of the most successful internal initiatives.
The project lead is a key role. This individual must be exceptionally goal-oriented, as this kind of work inevitably creates all sorts of #brightshinyobject distractions.
While one good project lead might be able to run multiple analytics projects, it's hard to imagine that this kind of project could be shared with other unrelated ones.
From left to right, the next role is that of the "business analytics and project management". Their job is interact with the business side of things -- extract knowledge from subject matter experts (SMEs), bring domain expertise into the picture -- and, once findings are established -- drive the operationalization of the results (no small task!).
A relatively new role turns out to be exceptionally important here: the data engineer. Available data must be discovered, determined to be relevant, sourced, cleansed and made available for use by others. No data at hand, no data analytics. And data doesn't magically appear in the analytical environment, it takes hard work.
Based on our EMC internal observed experience, this data engineering effort turns out to be the longest/hardest/most challenging part of any project.
Next, we have our familiar data scientists or perhaps analytical experts. If everything is set up for them (questions well-framed, data is at hand, resources available, an engaged team is involved, etc.) their work proceeds surprisingly quickly. If, however, they have to spend time helping to frame the question, sourcing the data, getting people to listen to their findings, etc. -- well, this drags on very long indeed.
Finally, we have the "IT Rep" role. This is a cross-functional role that masks the complexity of IT realities from the people trying to get something done. In addition to providing the platform and the tools, there are frequently data governance issues which must be navigated, as well as helping the data engineer do their data sourcing job.
Note that the "IT Rep" is a single role, and not a long list of people you can call and ask for favors :)
It looks rather obvious when it's laid out this way, and I guess that's the point: follow the process, and you'll get results.
It starts with defining the business question at hand: very precisely and unambiguously. That's not as easy as it sounds, which is where executive sponsorship comes in.
There is a vast universe of potential questions that are worth answering -- but which one should we tackle first, and why?
Part of the value that Malte's team brings to the picture is helping to frame that all-important first question.
The "science" part of data science comes next.
Collect multiple hypotheses, decide on a few key ones, design experiments, set them up, run them and interpret the results. Iterate until you've got something that's valid, or you've proven to yourself that no valid provable hypothesis exists.
The hard part here is setting up the experiment, which inevitably involves sourcing data, which may or may not be easily available, and so on. There's an interesting back-and-forth between difficulty-of-data-sourcing and experiment design -- not all experiments in data science are feasible!
At some point (usually) you come up with a predictive model that's worth putting into practice. Yes, the model might improve over time, but you want to move ahead with what you've discovered.
Operationalization comes next: putting the learnings into practice. The final difficult hurdle is organizational change management -- getting people to do things differently based on the model, and not their intuition or experience.
The ultimate goal is to increase the velocity of the process end-to-end: from framing of the question to measuring results. Currently the left-hand side of the process runs about 6 to 12 weeks. The stated BHAG is to engineer this so it can potentially be done in 5 days -- and that goal driving a lot of hard work in IT around the BAaaS platform: data pre-sourced, previous results already at hand, a community of practitioners to draw on, easy-to-consume resources, etc.
The right-hand part of the diagram is more challenging, of course -- as no one has come up with an effective organizational change management technology yet!
Part of the answer inevitably lies in simply doing more of it. The first behavior change to a data-driven process can be the most difficult; subsequent ones can be incrementally less challenging.
In essence, this chart represents one aspect of a "continual business process improvement machine". And -- of course -- we'd like to scale it as much as possible.
Pruning The Tree
One of the important discipline challenges Malte describes as "pruning the tree" during the experiment phase.
Many, many distractions await the team, and it's the role of the project lead to keep everyone on task, and defer the discovered branches (all valid, by the way) until some later time and place.
Keep in mind, this is about Big Answers (not big data), and someone is paying big bucks to answer a very specific question.
Malte's 7 Recommendations
#1 -- Get the team right: the roles and responsibilities.
#2 -- Don't skip the question-framing exercise -- it drives everything else. Ditto for brainstorming hypotheses -- the more diverse, the better.
#3 -- The actual data science happens quickly. Setting up the environment -- and driving the resulting changes -- takes real time and effort.
#4 -- Focus on the data that's at hand, and don't hypothesize the perfect data set. However, if you've got a strong hypothesis that supports gathering new data, don't discount that.
#5 -- Shamelessly re-use other team's data sets, models and findings. Share your sandbox with others.
#6 -- Disproving a hypothesis is also a useful result, but not nearly as satisfying. Sometimes there's no correlation and the world is a random place -- that's good to know.
#7 -- Put in place a process to deal with the many "tree branches" you'll inevitably encounter along the way.
Acknowledgements are in order: to Vijoo Chacko and SuiLin Yap with EMC Corporate Consulting for assisting with the inventory work, and to Frank Coleman of EMC Global Services for much of the summarization and findings.
I cringe when certain folks describe big data environments as magic black boxes: mountains of data in, predictive models out. Or when they focus on a particular tool (e.g. Hadoop) vs. the entire environment requried to get meaningful business results.
I find EMC's internal experience to be very representative of what we might inevitably see across more organizations in the near future: large, shared analytical environments, a "shopping mall" of ready-to-go data sets, plenty of resource and a wide selection of tools, and -- most importantly -- pervasive collaboration across the business. Not to mention a lot of hard work to get going ...
Because, after all, data science is a team sport :)