An application sends a perfectly formed write to the storage array, but the bits get scrambled along the way. The array isn't aware of the problem, and dutifully writes the malformed data to disk.
All appears well until sometime later when the data is retrieved and found incorrect, which typically kicks off a lengthy sequence of very unpleasant and disruptive activities: forensics around what happened, a quick restore and roll forward, and so on.
It doesn't happen very often, but -- when it does -- it's a painful experience.
And, as part of Oracle Open World, Oracle, Emulex and EMC are announcing the first availability of the joint work we're doing to further minimize this sort of problem.
It's All About Handoffs
Borrowing from Oracle materials, it's easy to see where the problem originates. Reads and writes flow from database to storage through a SAN. While each of the subsystems have good internal protection against corruption, the problems can arise at the handoffs between the subsystems.
A flaky cable or connection, a dodgy HBA, someone unplugging cables, etc. -- can potential introduce data corruption at the handoffs between servers, SANs and arrays.
The likelihood of a problem occurring is somewhat proportional to the number of writes being done; a heavy write-intensive transactional database such as Oracle is more exposed than, say, a user file system. Exposure to potential problems also increases as the complexity of the I/O subsystem increases: more ports, etc. -- think large, complex SANs.
Indeed, in very large transactional database environments, database administrators are quite familiar with various sources of data corruption (including this one) and take extensive measures to recover quickly and effectively after an issue.
IT shops tend to put such demanding workloads on enterprise storage arrays such as EMC's Symmetrix VMAX, so it should come as no surprise that both companies have been working together on this issue for quite some time.
Towards A Better Solution
Other than being able to recover quickly, how would you go about eliminating this form of error by creating a higher level of end-to-end protection?
Many years ago, EMC was working with Oracle on this specific issue, and we both realized that if the database and storage array worked together, they both could provide the required integrity checking in an end-to-end fashion. The database could write logical checksums which the array could validate.
If a specific checksum didn't match, the array could reject the I/O and raise an error condition instead of simply writing incorrect data to disk. Problem solved -- at least in theory.
This groundbreaking work resulted in Oracle's H.A.R.D. (hardware assisted resilient data) initiative, which EMC was the first to support. The feature was introduced sometime around the Oracle 8i timeframe. It was a good effort, and enjoyed decent support from other storage vendors, but it had some serious drawbacks.
First, the approach was completely proprietary to Oracle -- there was no underlying standard. Second, the checksum process was quite extensive, and -- as a result -- introduced performance problems at scale. Third, the resulting environment turned out to be very brittle indeed: release dependent, used raw I/O devices only, sensitive to things being moved around, etc. Certain important features in ASM (like rebalancing and restriping) weren't supported, and so on.
Despite these limitations, we got H.A.R.D. running at a few very large mutual customer sites that were experiencing this form of data corruption, but -- after gaining some practical experience -- it turned out that the cure was often more painful than the disease itself.
By the time Oracle 10g rolled around, H.A.R.D. support was being phased out. A better answer was still needed.
Introducing the T10 PI (Protection Information) Standard
About this time, a new extension to the SCSI standard become available that provided an attractive path to a new approach.
The standard 512 byte SCSI block could be extended to 520 bytes, and the extra information could be used to provide additional protection -- that is, if all the players in the I/O chain could line up.
Another 8 bytes were now available, how could they be used?
16 bits could be used to provide a CRC (generated by the sending application) to ensure that nothing was being lost along the way. 32 bits could be used to provide a redundant logical block address to ensure that the data was written where intended. And the remaining 16 bits could be used to identify the specific application doing the writing.
We now had a useful standard in hand; but work remained to get all the component pieces lined up to make it useful for customers.
And that is what EMC and Oracle are announcing at Oracle Open World: the first customer-available solutions that support this new end-to-end error protection.
The Current Stack
As you can see, the initial solution components -- although generally available -- are very specific. The engineering teams are at work broadening support for more flavors of operating system, HBA, etc. -- contact your respective vendors for details if you're interested.
Although it's true that the Symmetrix has supported the T10 PI standard since 2009, there wasn't really much around to integrate it with at the time.
With this announcement, the Symmetrix VMAX is the first storage array to support this new end-to-end approach for Oracle.
The processing and interpretation of the Oracle-specific instantiation of the T10 PI standard is done by the Enginuity storage operating system running on the VMAX. There are no observable performance impacts from enabling this feature. Error reporting occurs at both the host and ASM level.
Attending Oracle Open World?
If you're interested in this and related topics, I'd invite you to drop by "CON11574 - Triple Oracle Database Performance with 80 Percent Less Tuning for Oracle DBAs" where our own Daron Yar and other EMC engineers will be presenting their latest work for Oracle environments.
Does This Really Change Anything?
Yes, and no.
I believe that the majority of IT shops are unaware of this specific problem and its likely consequences. From their perspective, an Oracle database occasionally gets corrupted, no one is really sure why, a recovery is made, and life goes on.
Does it make sense for them to reconfigure their environment with specific components to protect against this form of silent corruption? That's not exactly clear in the bigger scheme of things.
However, a smaller group of IT shops are quite familiar with the specific issue, and are looking for reasonable approaches to nail this one, once and for all. Anything that takes their critical Oracle database offline for any reason whatsoever causes great pain, and is a matter of great concern.
And the good news for them is that industry-leading vendors are working closely together to provide better solutions. At some point, we can probably look forward to broad transparent support across multiple application stacks, operating systems and hypervisors, storage arrays, management stacks and so on.
But it's got to start somewhere ...