About two months ago, I had to go around and turn them all of them off -- just too much pouring through the floodgates on the topic, overwhelming me. There seem to be an awful lot of people out there trying to make sense of it all.
Sort of like the whole topic itself, if you think about it :)
As one of my projects, I tried to create a clear-eyed, non-marketing synopsis of what EMC was doing around the topic: how we saw the opportunity, the challenges and where we were investing as a company.
Much like a previous piece I shared (EMC And Cloud: An Overview), the result doesn't make for an thrilling narrative, but it does serve as a handy summary of just how many of the pieces we've brought together so far.
Like any overview, I will inevitably forget to discuss one aspect or another -- but there's still a surprising amount we've done on the topic.
As the use of advanced technology becomes more globally pervasive, we are collectively generating a veritable ocean of new information. For some, this is a challenge to be dealt with. But for many, this is a unique opportunity like no other.
We have the tools and skills to extract new and unique insights that weren't available before.
And we can put our insights quickly into practice with new classes of applications.
The new-found power of predictive analytics powered by big data appears to be broadly applicable to almost every human endeavor:
- commercial industries: retail, telecommunications, e-commerce, etc.
- consumer finance: banking, credit cards, consumer credit;
- manufacturing and distribution: supply chain, quality, demand forecasting, etc.
- basic science: from molecular genetics / biotech (genomics, proteomics, connectome) to astrophysics / cosmology (e.g., sky surveys)
- healthcare: medical diagnostics, medical records, epidemiology, new pharmaceuticals (new molecules for specific drugs), and (eventually) individualized DNA-based diagnostics and personalized treatments.
- social sciences: economics / economic forecasting, sociology, etc.
- every engineering discipline: mechanical, civil, electrical, etc.
- government services: electrical grid, environmental protection and pollution control, water conservation, education, poverty alleviation, taxation, and law enforcement.
- IT disciplines: including capacity planning, service level management and advanced security.
Once fully appreciated, the potential can be breathtaking.
The driving force behind big data can also be seen as a perfect storm of massive new data sources, cost-efficient computing and powerful algorithms – collectively causing a fundamental revolution in how we think about extracting value from information.
Through the fledgling discipline of data science, organizations are finding that the vast quantities of existing data can be acquired, correlated and used to gain powerful insights into the world around us through big data analytics.
New insights require new applications to monetize, triggering a wave of next generation applications – as well as newer business models and strategies.
The use of data science and big data analytics has now spread rapidly from a few niche applications, and is progressively being applied to newer challenges across business, science and public policy.
Big data is quickly becoming the new ante for global competition, as well as holding out new answers to the world's most challenging problems. Some describe big data as "the new oil". Others describe big data analytics as “the new R&D”. We agree.
Big Data Analytics
To help visionaries capitalize on this new source of digital wealth, EMC is focusing intently on technology, expertise and partnerships. Although we believe we've carved out an enviable leadership position in big data analytics to date; even more is required.
There is vastly more data, and vastly more data sources. Speed and agility matter greatly. The resulting storage and compute requirements dwarf familiar database and data warehousing approaches, and demand a completely scale-out approach.
Information logistics – the ability to move (or minimize the movement) – of vast information bases becomes paramount. Governing how information is handled -- ascertaining its sources and eventual uses becomes a monumental challenge.
Big data analytics workflows must be designed to ingest and process an enormous variety of structured and unstructured data. A broad array of advanced analytical and visualization tools must be made easy to consume by data science professionals. These same experts need platforms to collaborate and share insights, often in a secure and trusted setting.
The resulting actionable insights create an immediate need for new forms of applications that can detect and quickly react to important patterns across a broad range of information sources.
1. EMC Big Data Technologies
EMC has been steadily investing in assembling a wide range of next generation capabilities to support this new method of extracting value from data.
1.1 EMC Greenplum
The centerpiece of our big data technology is EMC’s Greenplum UAP (unified analytics platform).
It is a purpose-built open software environment targeted at the new generation of data science professionals to help them quickly process, analyze, discover, and collaborate freely using the power of modern analytical tools (e.g. Hadoop, SAS, R et. al.) running on large, scale-out architectures.
In its appliance version, EMC’s Greenplum DCA makes it straightforward for progressive organizations to rapidly deploy hundreds of compute nodes accessing petabytes of storage.
The Greenplum big data analytics platform is greatly augmented by other EMC disciplines: virtualization, security, collaboration, application frameworks, converged infrastructure, data protection, storage and more.
Greenplum's capabilities been widely embraced by a growing community of thousands of data science professionals, and has emerged as a de-facto standard for newer data science teams.
1.2 Storage Infrastructure
Inevitably, big data in all its forms requires storage -- and lots of it.
Indeed, the value of many predictive models are directly proportional to the amount of data that can be brought to bear on the problem at hand, creating powerful incentive to construct very large storage farms in support of big data analytics capabilities.
In this light, EMC’s traditional competencies around storage become an extremely relevant topic when discussing big data in any form. At a technical level, the challenges associated with storing, managing and providing access to petabytes and exabytes – at reasonable cost -- is not trivial.
All three dimensions of large scale storage have the potential opportunity for order-of-magnitude improvements: capacity, performance and cost. Addressing these challenges in turn enables the deeper and wider use of big data in a variety of settings.
Architecturally, EMC invests heavily in scale-out NAS and HDFS architectures through our Isilon product line; it has quickly become a de-facto standard for big data storage environments, such as found at the Beijing Genomics Institute.
On the HPC (high performance computing) front, EMC has recently received a US DOE (Department of Energy) grant to research newer technologies combining flash memory and traditional disk storage to create “burst” capabilities in support of ultra-high-speed simulations.
Storage costs must also be addressed: not only technology costs, but operational costs as well.
Compression and data deduplication approaches suitable for traditional enterprise IT environments do not always perform well in either big data or high-performance computing environments; new investments are needed to reduce physical storage actually required. Active tiering (automatic data movement) between different types of storage media (flash, disk, tape) are also required to control costs, enabling more data to be acquired and processed.
Data has value, this is especially true in big data environments -- hence it must be protected against corruption and loss of availability. Traditional enterprise-oriented data protection approaches are challenged to offer both the scale and economics required in these ultra-large environments, creating an opportunity for EMC to extend its current data protection capabilities in new directions.
1.3 Big, Fast Data Applications
Once insights are discovered, there is often a strong incentive to operationalize them, usually in the form of a purpose-built application. Dubbed "big, fast data", these applications present unique architectural challenges as compared to the more familiar enterprise or consumer applications.
In-memory processing is quickly becoming the data platform of choice, as evidenced by the popularity of VMware's vFabric GemFire and SqlFire capabilities. Additionally, new capabilities are being developed to virtualize real-time information feeds from legacy applications that weren't originally designed to meet this need.
1.4 Enhanced Collaborative Workflows
Not all workflows can be fully automated, very often human beings must make key decisions, armed with the insights of the newer predictive models. Dubbed "enhanced analytical workflows", these newer models are enabled by EMC's IIG group, built on the foundations of Documentum and the newer XCP application generation environment.
As experience is gained from human decision making around analytical insight, the resulting rules are then formalized in automated applications, enabling continuous process improvement.
1.5 Security And Information Governance
A related challenge is information governance and risk management. In most big data models, information is continually sourced from an ever-changed set of sources, and the resulting insights put to work in an an ever-changing number of ways. This, in turn, creates entirely new forms of information-based risk for which there is no simple solution.
EMC believes we have many of the required ingredients to begin to address this new challenge: rich metadata management capabilities, advanced risk management frameworks (e.g. Archer) as well as emerging information forensics capabilities.
2. Achieving Big Data Proficiency
Technology capabilities are ideally coupled with strong executive leadership to create a pattern of increased proficiency across the organization.
Working with customers and partners, EMC has developed a three-phase model to achieve analytics proficiency and hence competitive advantage.
2.1 Make Data Easy To Discover And Experiment With
This takes the form of an “analytics-as-a-service” shared platform that encourages the wider use of predictive analytics. A “shopping mall” of available data sources is created, augmented by a self-service portal that encourages experimentation and research across different data sources and analytical tools.
This platform service typically runs on a scale-out virtualized and secured cloud, and is complemented by skills training, program management and information governance.
To help our customers achieve this, EMC currently offers solution reference architectures along with program consulting to help organizations to build an operational analytics-as-a-service platform. These solution comprise EMC’s Greenplum UAP, Vblock and RSA technologies as needed.
2.2 Add More Data -- And Data Scientists
The second phase requires adding many more diverse sources of external data, while bringing in data science expertise to build successively better predictive models. This activity almost always results in one or more important insights that demand a formal reaction, usually in the form of a specific application or new workflow.
More expertise, more resources and more tools are required at this phase, along with a strong need for workflow and collaboration between data science professionals and those that work alongside them.
This phase inevitably demands unique data science expertise, and a platform to work effectively. Qualified data science professionals are currently in extremely short supply – a condition that is expected to persist for many years.
In addition to supplying EMC-badged data science experts, EMC has created a popular Data Science certification offered through the EMC Academic Alliance. This coursework helps proficient traditional analysts learn and practice the newer data science skills. This offering has been exceptionally popular since its introduction, with over 1,600 students going through the training.
Many of the advanced features of the Greenplum UAP (unified analytics platform) are targeted at teams of data science professionals to expedite key workflows and collaboration. The use of scale-out infrastructure ensures there’s always plenty of horsepower and capacity for even the most sophisticated predictive analytics.
2.3 Operationalize The Insights
The third phase involves creating purpose-built applications that can act on the important predictive insight gained in previous phases. Known as "big, fast data" applications, these are particularly challenging from an architectural perspective, as they must quickly process and act on data from potentially dozens of sources. This new requirement is starting to create a strong demand for better technology approaches.
EMC is working with selected customers and partners to help create this new generation of big, fast data applications, which is in turn driving new areas of EMC innovation. Key areas of focus include the use of in-memory data stores, parallelized workflows, tiered storage architectures as well as virtualizing information access to legacy data-generating applications.
The third phase also needs a rigorous methodology to specify, build, integrate and operationalize large-scale fast data applications powered by big data predictive analytics. EMC’s recent acquisition of Pivotal Labs forms the core of our expertise in this area, using technology sourced from EMC’s divisions: Greenplum, VMware and SpringSource.
Despite these considerable capabilities, there remains a much broader societal challenge of creating a new generation of business, government and academic leaders who are comfortable wielding the powerful new tools. Again, just about every familiar discipline – marketing, research, life sciences, physical sciences, finance, human resources, manufacturing, education, defense, etc. – is now being transformed by big data analytics. New skills -- and new ways of making decisions -- are needed across the board.
Just as leaders have learned to understand the power of the internet, we believe they must now learn to understand the power of big data analytics and advanced predictive models.
3. Investing In The Data Science Community
Data scientists are widely seen as the new magicians of the big data world; currently, they are in extremely short supply. There is near-universal agreement that academic institutions worldwide are currently not keeping up with demand for this new and valuable skill set.
In addition to investing in data science coursework, EMC (through our EMC Academic Alliance) is working actively to encourage universities to add the fledgling discipline of data science to their existing offerings. Going further, EMC has entered into partnerships with selected universities to define separate and distinct advanced degree coursework in data science – independently from established academic disciplines.
EMC also sponsors a series of successful data science summits; opportunities where data science professionals can gather as a professional community and exchange ideas. Their direct input regarding their challenges and opportunities drives much of EMC’s research in this arena.
As an example, one challenge identified by the community was the lack of large-scale resources to design and test big data analytics software. To help, EMC recently announced the creation of a 1000-node Hadoop analytics cluster which is currently available for software and algorithm research.
4. The Road Ahead
From an IT-centric perspective, it is one of the few compelling areas where IT can potentially play a leading role in how business is done.
The road is long, the answers aren't always clear -- but significant progress has been made by EMC and others in creating the capabilities needed in this new world.
The journey has just begun.