The public cloud proponents often wonder -- why is that the enterprise IT world is not beating a path to their door? After all, each of these services provide great functionality coupled with an attractive consumption model.
I patiently explain to anyone who will listen: enterprise IT is different.
And one of the ways enterprise IT is very different is the critical need to manage availability.
Availability is one of those things that everyone takes for granted -- until there's a problem. When there is a problem, a speed recovery is expected: service resumed, data restored. It doesn't happen by magic -- there's an enormous amount of hard work that goes into creating highly available environments that protect predictably. And never enough resources to do this right.
Most people not deeply entrenched in the topic think this must be a simple matter: a public cloud vendor simply establishes an SLA around their services, and incurs penalties if this is not met. Case closed.
The reality is far more complex: effectively managing availability at scale is a difficult and nuanced discipline. It's incredibly hard to get right in a traditional enterprise setting; much more so when external services are part of the landscape.
What stands out for me is that one of the core tenents of virtualization and cloud -- abstraction -- is in direct conflict with what many availabilty professionals demand: a precise knowledge of configurations and associated failure domains.
And, until we solve that one, I don't think we're going to see a lot of critical enterprise workloads using cloud resources.
Down The Rabbit Hole
While it's true that everyone in IT cares deeply about availability; in larger settings you'll find a handful of availability professionals that look at things very holistically.
They have an intimate knowledge of critical business processes, supporting applications and how they touch the IT infrastructure. They help define and implement the organization's availability policies. They help design and rehearse recovery processes to guard against the inevitable.
And they care very much about the underlying configuration: hardware, software, shared services, etc. The foundation of their work is understanding failure domains, and to take explicit steps against potential issues.
A failure domain can be anything that impacts the usability of an application: failed hardware, failed software, failed data, failed data centers, failed power grids, failed network provider, etc. It can be a small thing, or a very big thing -- it's all part of the landscape that must be considered and protected against.
Deeper We Go
Protecting against potential failures can be an increasingly expensive proposition as more failure modes are considered, so availability professionals are always trying to balance needs vs. resources.
Business processes and supporting applications are not static things, nor is the infrastructure and shared services they reside on. And the smallest change anywhere in the stack can often transform a well-thought-out availability strategy into a useless exercise.
It's not just the IT team that has a stake in availability. Application owners, business leads, risk management officers -- they all want to know that established policies are being complied with, and recoveries can work as expected in the event they're needed. And, of course, there's never enough money to pay for the protection that people ultimately desire.
Here Comes Virtualization, And Cloud
In some ways, virtualization has been a huge boon to availability professionals. Consider a product like VMware's SRM, and its ability to orchestrate both hardware and software to automate a remove recovery in the event of an infrastructure failure. Or to practice such an event repeatedly and non-disruptively. Architecturally, virtualization nicely containerizes applications, and abstracts them from the underlying infrastructure.
All good, right? Well, no.
The more you abstract, the harder it is to discover physical failure domains. Do I really have two physically independent network links, or just the appearance of that? Is this failover resource in a completely separate data center row, or is it actually located in the same rack? Do my backup source and target have enough isolation and separation? What is true for physical assets is also true for shared software services that applications depend on.
Take that same line of thinking to any external IT service (e.g cloud), and the same concerns pop up, only worse. Maybe you understood how your cloud service did something six months ago, but things have changed since then -- and how would you know? Now that the abstraction is far higher up the stack, it makes managing and controlling availability even more difficult.
Wait, you say. Don't cloud services like Amazon's AWS provide things like availability zones and regions? Yes, they do. But there's scant detail on how they're actually implemented and managed, and no guarantee whatsoever that things won't change without notification. Besides, any availability management mechanism offered by any cloud provider is likely to be incompatible and distinct from whatever technologies and processes are being used internally.
Something's got to give ...
Where The Rubber Meets The Road
There's an illustrative conversation that's happened hundreds of times, and is likely to happen many more times.
The business is considering using a public cloud for all or part of a critical business process. There's a meeting that inevitably happens between the various stakeholders, and the person who's responsible for IT availability.
The question gets asked: can you guarantee the required level of availability that we need? And the availability pro inevitably has to answer "no, I can't". He or she doesn't have visibility into physical configurations, nor visibility into change control processes, or anything else that's essential for managing availability from an enterprise perspective.
Given their need to manage availability, the public cloud is largely opaque: a black box. The only alternative is to extract a high-level SLA guarantee from the service provider, and hope for the best. And that usually won't cut it from a business perspective.
Now, to be clear, many cloud services are highly available -- perhaps more highly available than their enterprise equivalents if they're used as intended. But we're discussing a different concept here -- the enterprise need to manage availability using documented and repeatable processes.
Solving this challenge -- and opening up more public clouds for enterprises that demand compatible availability management -- won't be an easy task, and we're not likely to see workable solutions in the short term.
For starters, each tenant will inevitably need access to their view of an extended CMDB, rooted in physical reality. Understanding the underlying configuration enables intelligent choices around workload placement, and invoking required services for protection.
Each tenant will also want full visibility to change control processes and version control -- and how inevitable changes might end up impacting their environment. Finally, each tenant will want audit-grade monitoring and reporting, not only for internal needs, but to satisfy external compliance requirements.
Not a lot of that out there today, is there?
The Counter View
There's a school of thought that believes that -- in an ideal world -- none of this should be needed. Application developers should anticipate potential failure modes (infrastructure, data corruption, shared services, etc.) and implement the required redundancy as a component of their application design.
While this might be a reasonable and logical approach for many of the web-scale players (e.g. Google, Netflix, etc.), it utterly breaks down for the majority of enterprise IT shops. SAP, Oracle and Exchange are going to manage their own availability? Really?
Enterprise IT shops run vast portfolios of applications from many different sources. Applications are often chained together into business processes, which end up being the entity that wants to be protected vs. individual components. There's a strong motivation to use common mechanisms and processes for the entire portfolio vs. a complex array of point approaches.
No, in the enterprise world, availability will be thought of as a service provided to applications, and not likely to be embedded in applications themselves. And I can't see that changing anytime soon.
Where Do We Go From Here?
The shape of today's public cloud market is clear: there's Amazon and then there's everyone else. But what is also clear is that the majority of critical enterprise workloads are staying clear of most public clouds for a variety of reasons, including this important one.
Private clouds (or their hosted equivalent) appear today to be the preferred approach for enterprise workloads: availability can be managed using technologies and processes that meet enterprise requirements, and don't force uncomfortable compromises. I would expect that newer hybrid clouds that are process and technology compatible with today's private clouds to be the logical beneficiary of enterprise workloads in the future.
And, yes, as a VMware employee I'd like to see vCHS evolve in this direction.
No surprise, virtualization and cloud has been transformative for our IT industry: lower costs, better agility, and changing the way IT services are produced and consumed. There's no going back.
But as we look at how availability is practiced in larger enterprises, it's clear to me that we've accumulated a significant amount of technical debt.
And the sooner we pay off this debt, the sooner we can get on with things.
Like this post? Why not subscribe via email?