The next big thing in IT?
Information classification tools, I think.
Tools that can help IT discover information sources, classify them, make a decision, and process them to save money, make money or stay out of trouble.
The stakes are getting higher.
We can't ask users to classify their own information.
And, according to IDC, we'll have six times as much information sloshing around in a few short years.
All signs point to a personal forecast of extreme customer interest in information classification tools sooner than later. It's already started.
Here's why ...
The stakes are getting higher
Let's start with costs. Yes, storage is getting cheaper, and there's more choices data reduction technologies starting to appear.
But underneath that all is the persistent notion of service levels for information -- different tiers of service (performance, protection, etc.) with significantly different costs -- and signficantly different service levels.
Cheap storage usually means slow (or unprotected) storage, and not everyone can tolerate that service level.
And I think that just about everyone will want to create very different tiers of storage and protection -- at very different cost points -- to meet the needs of their business. Tiering has been going on for several years in the industry and with customers.
But how will you decide what information will need to be on what tier at what time?
Answer: information classification tools
Let's move on to risk.
Certain kinds of information carries different form of risks -- risks of not retaining it properly, not securing or encrypting it properly -- to mention a few. And few things can get more executive attention than having a bad information security day.
But how are you going to find the bits that need to be secured and retained somewhere in the multi-hundred terabyte haystack?
Answer: information classification tools
And then there's creating new value.
Somewhere in those bazillion files and emails and reports and powerpoints is something that someone could use in a different context. Call it enterprise search, call it knowledge management, call it content management -- the idea is the same.
It's an opportunity to create new value from information you already have.
So, how are you going to find the useful bits that ought to be saved and re-used somewhere else?
Answer: information classification tools
Users can't classify their own information.
Heck, I can't even classify my own. Every time I try to set up some sort of scheme to organize things, I end up either finding it too much trouble, or I forget my schema, or both. So it all tends to become a random mess until my disk drive fills up, and then I buy a bigger disk drive.
Asking users to classify information for corporate purposes might be workable in a few instances, but certainly not more broadly. We can't get people to purge their email, let alone adhere to a rational classification system.
So IT will end up having to do it. Hence the need for information classification tools.
A short history of information classification at EMC
EMC made an initial run at this when we started promoting the idea of ILM -- information lifecycle management.
The idea was to create different storage service levels, and align application information to service levels.
But it felt like we were going after things with a very big hammer. All we could do at the outset was very coarse-grained classification, e.g. production applications were tier 1, decision support was tier 2, file systems were tier 3 and so on.
Several years ago, there were very few ways to actually get inside these large beasts, and look at different service levels within a specific environment (files, databases, etc.) so things ended up getting classified at a higher service level (and cost!) simply because we couldn't disaggregate the important bits from the less-important bits.
The first round of EMC technology to do this was with email (EmailXtender through the Legato acquisition). It evolved into a very sophisticated email analyzer that could make all sorts of nuanced decisions about the value of an email, and handle it appropriately. Very successful product for us.
But we learned some interesting things along the way.
The first thing we learned is that you could only get so far with classification just by looking at the wrapper. To do a good job, you had to look inside the email, which opened up a huge can of worms.
The second thing we learned is that the same "classify-decide-process" loop was at the heart of all three outcomes: saving money, making money and avoiding risk. We saw customers do one email archiving project after another simply because there was a new concern or requirement. At its core, it's all the same thing, so I would advise of thinking of it that way.
Hence describing it as "information classification" rather than "email archiving".
But files turned out to be an even bigger problem (and opportunity) than emails. For files, we had a nice set of products that could look at files externally (names, usage, etc.) and do a reasonable job of classification (DiskXtender, VisualSRM and others) and move them to cheaper tiers if no one was using them, (or if you fell into disfavor with the IT guys).
But, once again, we quickly realized that to do a really good job in classifying files, we had to open them up, look inside and make some decisions. Maybe look for things like personally identifiable information, or account numbers sitting in files.
We also found out that, unlike email, some IT shops might not be 100% sure of all the file servers in their enterprise, so discovery become important.
Hence EMC Infoscape. More on that later.
We also found something that didn't work out as well as we had thought. Initially, we thought that there would be a huge opportunity for classifying information in databases. We were kind of wrong.
First, databases tend to be much better controlled and managed than file systems and emails. Not that chaos doesn't occasionally creep in, but -- in general -- there's some pretty good discipline about what gets in and who gets in. Second, in turns out that the application and database vendors (e.g. Oracle, SAP et. al.) were in a lot better position to create the tools that could sift through information and make decisions. Not much role for a third party like EMC.
We have a good product -- DatabaseXtender -- that works as advertised -- but, compared to the interest in emails and files -- it's more of a nice-to-have rather than a need-to-have, market-wise.
Oh, yes, and I should point that emails and files (unstructured data) is where all the explosive growth is, and where the challenges appear to be the greatest.
And, not surprisingly, users were extracting all sorts of interesting information from the tightly controlled databases, and putting it in files and emails. Yikes!
So where does information classification go from here?
I see three axes of potential future development.
One axis will be more powerful forms of classification.
When we look at the engines that can make decisions on information today (emails, files), they're good, but they're not perfect. Actually, I think this capability is just where it should be, because IT users aren't really quite sure what they'll need.
As a historical example, I remember the first time I started putting query tools in front of end users. As they became more comfortable with the concept, their requirements in a query tool virtually exploded, which led to more sophisticated tools, which led to more sophisticated requirements, and so on.
I think the decision engines associated with information classification will go through the same sort of explosive evolution in the next few years.
As an example, today Infoscape can do combinations of keywords, and some fuzzy matching, and there are more sophisticated templates coming for things like personally identifiable information, and credit card numbers, and the like -- but there is room to do far much more.
I got to play with neural networks a long time ago. The neat part is that you didn't have to write rules: you simply showed something to the neural network, and it figured out what was important, and what wasn't. I can imagine an organization feeding manually classified documents to a neural network engine, and it "learning" over time how to classify information in a way that procedural approaches just can't match.
And I'm sure that -- over time -- there will be a rich ecosystem of specialty classification engines (voice, video, etc.) that can snap into an overall framework.
But it makes no sense to invest in all that cool stuff until IT organizations start to use the tools that are already there, and figure out what they need next. Just like query tools were many years ago.
And there's more than enough available from EMC and others to get started.
Another axis will be around developing more sophisticated outcomes once a decision is made.
As an example, one outcome today is to copy (or move) a file or email from one place to another. You might do this for cost reasons, or redundancy reasons, or something else. You might want to have it continue to be transparently visible to the creator, or maybe not. Most of that functionality is there today.
A more urgent example is that you find a file or email with something really sensitive in it. You'd like to immediately drop that puppy into an IRM (information rights management) environment that can secure and audit access to it.
When you consider all the data leakage issue looming with files and email, I'm sure this will be a popular outcome to information classification.
Lastly, while you're grinding through an email or file, you could be collecting keywords that could help someone find something at a later point. Maybe make it searchable, or perhaps land the object in a repository of some sort. Not entirely an urgent matter today, but I've seen signs of it in some customers I've talked to.
And, finally, there will be a natural trend to shorten the amount of time between when an information object is created, and when it's classified.
Interestingly enough, many of the technology component to do this largely exists today. As an example, EMC's Celerra has an ability to quarantine a file until something has looked at it and released it.
We use it for anti-virus today (it's called CAVA, I think), but -- architecturally -- the same approach could be used to detect insecure information being written to a file share.
And there are agents out there that can intercept I/O to physical devices (like USB ports) -- they just don't have any policy smarts and sophsticated outcomes to go with the intercept. But it's all possible.
Of course, real-time classification and outcome processing will take some pretty scalable engines to ensure that users don't have to wait -- but that's achievable as well.
The simple truth
As I've written before, I believe that -- ultimately -- IT will be responsible for information in much the same way as the CFO is responsible for money. They'll be held accountable to save money, make money and stay out of trouble.
I call them informationists.
I don't think the problem will be with well-structured databases. I think the problem will be with everything else.
And I think that more and more IT organizations will be looking for ways to turn wild information into managed information.
And information classification tools will most likely be at the center of the discussion.
And, truth be told, I see more and more interest in the topic from the customers I talk to with every passing week.

Chuck, I think what you describe sounds like things in an ideal world. Unfortunately in the real world, most of what you describe is not practical or cost effective. For instance, if it is possible to scan for data that give some kind of financial benefit, if you don't know what you're looking for, how are you going to evaluate what the value of that information will be when you locate it? In addition, hopefully all, if not 90%+ of relavent corporate data should be in databases, which you've excluded from the discussion. Whilst having the tools will be essential, I think we will also need a new breed of data analysts who know how to use such data and to know what to look for, rather than speculative data trawling.
Posted by: Chris M Evans | April 21, 2007 at 12:33 PM
Hi Chris -- thanks for the comment, but I disgree regarding many of your assertions.
Your first assertion is that scanning (or classifying) information won't be practical or cost-effective.
I would offer as counterpoint the thousands of IT shops that I know of that routinely scan email for archiving or other purposes, and the growing number of shops that are starting to scan file systems for similar purposes. Yes, the scanning and classification is somewhat primitive today, but I think that will come in time.
As far as having to know what you're looking for in order to find value, the jury is out on that one, in two regards. First, I know of several customers that have created lists of keywords and templates to scan for (e.g. looks like an account number, or a project name, or maybe a customer ID). Second, it is possible to create indexes of all keywords scanned for (a-la-google) albeit at the additional cost for storage. And both are being done today.
As far as your assertion that 90+% of relevant corporate information should be in structured databases, well, that just doesn't seem to be the case anymore.
Email does not fall into that category, nor do all the reports, spreadsheets, presentations, memos, etc. generated from that database information.
Ask anyone who manages a large shop where the information growth is coming from, and they probably won't say "databases".
Furthermore, the problem is worse as at least there are some basic information mgmt concepts in dbms that file systems etc. do not have.
So, how do you recreate database-like properties from unstructured information? Well, you scan them, and put the metadata in a repository ...
I do agree with your last point -- we will need a new class of specialists who understand what they're looking for ("informationists"?) in the context of the broader corporate agenda. And I'm sure they'll wield big words like "taxonomies" and such ...
Thanks for writing ... I enjoy the discussion!
Posted by: Chuck Hollis | April 23, 2007 at 08:23 AM
Hi Chuck,
Very interesting topic!
I've been working in the Records & Information Management field for over 30 years and I always found that auto classification or auto categorization does not work very well in today’s environment because there are so many variables associate with having a system determine where structured, semi-structured and/or unstructured records are to reside in a corporate file classification scheme. Over the years, I found that organizations that have a sound records management program in place, with a well established file classification system, appropriate policies and training in place, have a greater success in having their employees properly file their documents in the appropriate places within the corporate filing scheme than organization that have no such records management program in place.
For an example, our Legal Department deals with contracts and therefore we have a contract cabinet created in Documentum with an alphabetical listing of customers which include specific contract type folders under the customer folder listing. DCO is used for our legal users to file there emails and documents and they themselves know what contracts they are working on and any other employees working on these contracts are provided with the proper access rights so that they can use the file breakdown as well. Having records management policies and training in place simplifies everyone life as they do know how the corporate filing scheme works and the importance of properly filing any type of records within our standard corporate file classification system.
I’m not saying that information classification tools are not great, however, I do find that if you know, as an employee, what projects you are working on and that a file classification scheme was created for this project or projects for you, then I see no problems in your filing documents in the appropriate files if the files are displayed to you for quick access for filing documents accordingly.
Those are my thoughts...thanks for reading me!
Posted by: Albert Carriere | April 23, 2007 at 10:14 AM
One of the best ways for users to support automatic classification is to create more structure (context) in their documents. Unfortunately, the majority of knowledge workers fail to do this because of two related institutional behaviors.
Firstly, we don't train kids in school to use document templates when they first learn how to use word processing tools, spreadsheets, etc. Universities don't correct the behavior, because they assume students have either learned how to use the tools in high school, or can pick up the techniques during writing assignments.
Secondly, most businesses don't invest in the creation of relevant structured templates and in training their staff to use them. The result is that most workers create new information content from the blank word document, or the blank Excel spreadsheet. Next generation schema-based tools like InfoPath have seen very slow adoption, and little or no investment in the creation of relevant schemas.
I think it could take another 30 years to flush the unstructured content generation out of the educational and corporate spheres and replace them with a new generation that understands the value of creating or acquiring schemas before embarking on a lifetime of content creation.
Posted by: Peter Quirk | April 25, 2007 at 10:57 AM
Wonderful point, Peter.
And, as I think about it, there is a new skill that we all need to learn about assigning tags, keywords, metadata and so on to the things we create or touch.
And I know that I'm really bad at that particular skill. I'm not probably alone in that regard.
However, I would offer that we'll see activity on this front far before our learning on this improves and eventually institutionalized.
First, any sort of corporate categorization scheme will have to be owned by a central authority. Individuals and business units can influence and extend, but the core I would offer will have to be centrally governed.
Second, I think that the pressure to classify unstructured information after-the-fact will be so severe that we'll see rapid adoption and extension of these tools in a relatively short period of time.
Email classification, as an example, is not only widely used, it's become very sophisticated as well. I would think files are next in line.
Good thought, Peter. Thanks for the comment!
Posted by: Chuck Hollis | April 25, 2007 at 11:27 AM