I've known Charles through many roles over the years (he has a most fascinating career story) and I can always count on him for a fresh, insightful perspective.
The event itself was pretty interesting as well: it was one of EMC's TAP (technology advisory program) sessions.
Various groups at EMC routinely run small, informal forums where customers and partners are invited to come in around a specific topic, listen to what we have to say, and -- most importantly -- share their unique perspectives.
This one did not disappoint -- we were discussing the shifting database landscape, and wanted to really understand what our customers were seeing and experiencing.
Charles was kind enough to lead a discussion on the future state of databases and data management.
After the session, I asked Charles if I could share the essence of his presentation with a wider audience. Graciously, he agreed.
I hope you find this as thought-provoking as I did.
The Core Premises
How we use information is fundamentally changing: call it big data, call it cloud, call it mobile web apps, whatever. In this world, data management and data fabrics become very important as you're essentially using information in new ways.
But as we look across the landscape today, the majority of enterprise IT is deeply invested in traditional relational databases. How will we get to the new world?
Charles presents a case for five forces -- three disruptive forces, two bridging forces -- that will encourage us to collectively let go of the old world, and embrace the new one.
Relational Databases -- The Familiar Starting Point
The vast majority of structured information today lives in the familiar relational databases that we all know and love so well. It's very mature technology – 40+ years. It does what it was designed to do very well. And relational databases aren't going away anytime soon.
Today, relational databases are used throughout the workflow of information: from transaction capture, to operational reporting, to data warehousing. Because they must capture transactions, they maintain ACID properties throughout -- either a transaction happened, or it didn't -- no ambiguity allowed!
An incredibly massive ecosystem of applications and expertise has been developed on top of relational databases, with Oracle of course being the clear market leader in most enterprise settings.
Because of the massive investments in applications and expertise, the data management layer is perhaps the most "sticky" part of the entire IT stack. The notion of having to switch one database technology out for another makes most people shudder at the thought.
During the meeting, Charles asked "how many of you use a lot of Oracle in your environment?". Predictably, all hands went up. Then he asked "how many of you like Oracle, and want to use more of it going forward?". Not a single hand stayed up.
I think Charles did this to illustrate a point: the database layer is so sticky that most users feel trapped. This doesn't seem to hurt Oracle's business model, but it's not ideal from most customers' perspective.
But the world is changing.
Charles then went on to outline the five forces that he believes will cause the majority of data management thinking to embrace a new model going forward -- dubbed the datacloud. Relational databases will undoubtedly be around for generations (like mainframes!), but they will inevitably be seen as less strategic by for five distinct reasons.
#1 -- From CRUD to CRAP
You've got to love these acronyms -- but there's a very serious discussion underneath.
His observation was that in the relational model, everything was a transaction. You created a bit of information, it was read back by various applications, it would be updated, and eventually deleted.
Create, Read, Update, Delete == CRUD. There, you're never going to forget that, are you?
But big data and new processing models are changing that model very fast. In addition to the mountains of information we generate as human beings, an even greater tsunami is approaching: machine-generated information.
Not only do the sheer capacities break our traditional relational model, the usage model is distinctly different -- which Charles describes as CRAP.
You create an information object. It gets replicated to different places for reuse and perhaps protection reasons. You never update it, you simply append another bit of more current information. And, in a big data world, it all gets processed -- you're interested in the entire history, and not merely the latest current value.
Create, Replicate, Append, Process == CRAP. You won't forget that one either, I'll bet.
The assertion is that majority of the valuable applications created in the future will be CRAP-like applications. These new applications will need different capabilities, and strict compliance with the old relational model will be near-impossible to achieve.
The iron grip of the relational database gets lessened as a result. It's already clearly started in big data world, but it will inevitably continue.
The CRUD style never goes away, it just becomes comparatively less relevant over time.
#2 -- Data Will Be Everywhere
The previous model assumed that IT owned the stack, the data lived in a limited number of places, standards could be enforced, etc. All signs shows that world is giving way to a hybrid cloud model and plenty of SaaS applications to go around.
If you're lucky, you get high-level APIs to get information out and do something useful for it. Your newer applications will have to learn RESTful interfaces if they're going to do anything with all that dispersed, SaaSy data.
That part seems to be inevitable.
In the act of creating these newer API-based data fabrics, the dependencies on specific relational database technology is greatly lessened. You're now talking at a much higher level. And that makes it far easier to slide in non-relational data providers underneath.
Nobody's talking directly to the database anymore. As a result, the death grip of the relational database gets weakened again.
#3 -- The Democratization Of Data
In any modern information-driven business model, *everyone* is a voracious consumer of analytical data product: CEOs, truck drivers, sales people -- everyone. Going even farther, many businesses are moving aggressively to offer "information products" to their customers and partners -- the “product” is the information itself.The familiar spreadsheets and BI reports of today give way to powerful applications where virtually anyone can visualize, probe and collaborate around data insight. Everyone wants to see data their unique way, nobody wants stale data, and certainly nobody wants to wait for it.
Putting everything into a limited number of relational databases (or even data warehouses) just can’t keep up in this world: performance, agility, capacity, cost, etc. The technology simply wasn't designed to this.
The familiar model of running extracts, batch reports and sending copies to everyone probably will never die, it just gives way to an entirely different model -- just-in-time information.
As the business demands more from its information, new technology is inevitably brought in to meet the new requirement. And the grasp of familiar relational databases won't be as strong as it used to be.
#4 -- Virtualization And Automation
The IT world is virtualizing -- and quickly. Familiar server virtualization has now expanded to software-defined data centers, and even production databases are now routinely virtualized.
But something important is also happening -- resources are being delivered as a service in the process. The user focus inevitably shifts to ease of consumption and simplified experiences, and cares far less exactly *how* the service was delivered: e.g. here's your SQL-compatible data service -- go have fun.
In any virtualized ITaaS model, do you as a user really know (or care) what kind of processor your app is running on? Of course not. If I'm consuming data management services, at some point I won't really care what database is behind the scenes, providing the service.
Virtualization goes hand-in-hand with automation. The provisioning, consumption and housekeeping of different databases becomes much less of an effort all around. All of the sudden, the notion of bringing in a newer data management layer to solve newer problems doesn't sound as onerous as it used to -- the IT team isn't as resource-constrained supporting what they already have.
Charles describes this as a "bridging" force -- it helps both sides along. Virtualization and automation can make the legacy world of databases much easier to manage, which frees up resources to look at newer, better ways of doing things.
#5 -- Open Source
Any mature technology is a candidate for commoditization, and one of the many great things that open source does so well is commoditize mature functionality. Relational databases are certainly mature technologies, and -- already -- there are great compatible open-source based alternatives in the market that many are using with great success.
Look what Linux has done to the proprietary UNIX market in less than a decade -- there's just not much money in it any more.
No massive profits -> reduced investment by the big vendors in R+D, sales, support -> the alternatives start looking more attractive to customers as they realize their investment isn't going anywhere attractive.
But, as Charles points out, there's another useful effect of open source: it makes newer "datacloud" technologies far easier to evaluate and prototype as the associated costs are so low.
Sure, there's an initial investment of some sort (time, resources, etc.) but we're talking a tiny fraction as compared to evaluating traditional, proprietary technologies. As a result, they find their way into IT settings much more easily than their historical predecessors.
Put together, you've got open source as another "bridging" force: it helps the legacy relational world by applying price pressure, and it facilitates the new datacloud world by making it easier to evaluate and consume.
The Prototype Big Data System
Charles then shared a simplified schematic of how all the pieces work together in a datacloud model to create new forms of information flows.
Starting on the left, there's sourcing data. Much of it will come from traditional relational databases via the familiar ETL process. Still more will be sourced externally from a stupefying number of potential external sources.
It all gets landed -- in native form -- in a big data "sink", described here as a "unstructured big data file system". Think HDFS if it helps :)
An "information pipeline" is built over that data to extract value, much like a manufacturing line: real-time processing, interactive processing, and -- finally -- batch analysis.
A serious amount of computing power will be deployed around sifting through these data streams in real-time: filtering, analyzing and correlating. When you're looking for critical insights, time-to-decision wants to be measured in seconds, not days or weeks.
Value is extracted.
The same data then gets another set of important people banging on it: the interactive users. You know, those annoying folks who want to ask question after question about the data. They're the power users of tomorrow, so be ready. Their usage pattern is predictable: more data, more data sources, more questions, more tools -- that can never be delivered fast enough for them.
More value is extracted in the process.
That same information is then used for extensive batch processing (think Hadoop if it helps) to do very deep analytics over even larger datasets with perhaps even more processing power.
Even more value is extracted as a result.
The consumers of this value would potentially be (a) other applications that act autonomously on the insight, (b) business users to improve decision making, or (c) a value-added information product sold to others outside the organization.
Finally, all of this runs on a cloud of some sort, but a strong argument could be made that it's not a generic cloud: it's one that's been built-for-purpose to support the unique needs of the next-gen information pipeline -- described here as a "big data" cloud.
And, if you go all the way back to the starting point of this conversation (e.g. the familiar relational database world we know today), notice that you don't see *any* of that technology here. It simply isn’t relevant in this model, except maybe as one of many information donors.
They hold many orders of magnitude more information than you'd find in most traditional enterprises.
They use aggregated compute resources that dwarf most familiar data centers.
These environments are always growing very fast, and they appear to be delivering enormous value for the people who know how to use them.
They are the information factories of the future that extract value from mountains raw data.
Perhaps these same early examples might use familiar relational database technology here or there. But no one is really paying attention to them anymore.
Because the world has changed.