Runaway Storage Requirements
In terms of sheer volume, email is one of the fastest-growing storage consumers in the modern enterprise. It’s also one of the most difficult to rein in. "What has happened over the last five years is that the volume of information [handled by email systems] has increased dramatically," says Jens Rabe. "Growth has outstripped the ability of the messaging platforms to keep pace."
Because email has to be fast, it’s typically designed to use high-quality (meaning expensive) storage. A lot of email is kept on fibre channel SANs, using high-performance disks. Archives don’t need that kind of performance, and they can use much cheaper storage—anything from SATA RAID arrays to tape libraries.
In a good archiving system, the archiving is invisible to users, who simply click a message in their email and get the message back no matter where it’s stored. This is a very good thing, because most users want to keep the contents of the mailboxes around for a long time—like forever.
The usual response to exploding email storage is to impose quotas on users’ mailboxes. Of course, the users’ logical response to the administrator’s logical response is to archive the messages using the built-in feature in Exchange or Domino. This creates an even worse problem from a records-management standpoint because those archives are usually created on the users’ local disks. That means the email can’t be managed centrally; administrators have no idea who has what, and users are susceptible to loss in the event of a crash.
Even if the stores are created on better-protected network drives, these locally created archives are still difficult to manage and don’t make efficient use of storage. For one thing, there’s really no mechanism in the email programs for administrators to keep track of which archive has which messages.
One of the things that makes archiving so effective is that email messages are usually highly redundant. That’s redundant in an information-theoretical sense, not that you get 20 copies of the same thing clogging up your mailbox—although that’s a problem, too. As a first approximation, a good compression algorithm can shrink a message archive by 30% to 50%.
Attachments are a particular source of pain in managing email. A report or spreadsheet can run into multiple megabytes of data and go to hundreds or thousands of people in the company. And if the CEO decides to send the company’s annual report, complete with high-resolution pictures, as a .PDF to every employee in the company, well...
One of the tricks employed by archiving programs is to store a single instance of any message or attachment that goes to more than one person. Typically the programs will break the message apart into the header, body, and attachments and assign a unique identifier to each part of each message by using a hashing algorithm. By comparing identifiers, the software can determine which parts are duplicates and only store them once, with references in all the users’ archives.
Archiving companies like to talk in terms of a 70% reduction in storage requirements. While the savings in capacity and cost depends on the nature of the individual system, that figure isn’t unreasonable.