Understanding Local File System Functionality
Local file systems are the most visible type of file system. Those most often encountered on Solaris systems are UFS and its relatives, VxFS, QFS/local, and SAMFS. VxFS is a good compromise design that does most things fairly well. UFS and QFS/local, in particular, have almost diametrically opposite design centers. SAMFS is essentially QFS, without separation of data and metadata. QFS or SAMFS with the SAM option are different in several senses because their primary values are that they offer hierarchical storage management functions not available in the other three main local file systems.
Handling General File System Functionality With UFS
Every Solaris system includes UFS. Because it is the most integrated of the file systems, it has received a lot of development attention over the past few years. While it is definitely old and lacking some features, it is also very suitable for a wide variety of applications. The UFS design center handles typical files found in office and business automation systems. The basic I/O characteristics are huge numbers of small, cachable files, accessed randomly by individual processes; bandwidth demand is low. This profile is common in most workloads, such as software development and network services (for example, in name services, web sites, and ftp sites).
In addition to the basic UFS, there are two variants, logging UFS (LUFS) and the metatrans UFS. All three versions share the same basic code that blocks allocation, directory management, and data organization. In particular, all current versions of UFS have a nominal maximum file system size of 1 terabyte (the limit will be raised to 16 terabytes in the Solaris 10 OS). Obviously, a single file stored in any of them must fit inside a file system, so the maximum size file is slightly smaller, about 1009 gigabytes out of a 1024 gigabyte file system. There is no reasonable limit to the number of file systems that can be built on a single system; systems have been run with over 2880 UFS file systems.
Benefits of Logging
The major differences between the three UFS variants are in how they handle metadata. Metadata is information that the file system stores about the data, such as the name of the file, ownership and access rights, last modified date, file size, and other similar details. Other, less obvious, but possibly more important metadata are the location of the data on the disk, such as data blocks and the indirect blocks that indicate where data locks reside in the disk.
Getting this metadata wrong would not only mean that the affected file might be lost, but could lead to serious file system-wide problems or even a system crash in the event that live data found itself in the free space list, or worse, that free blocks somehow appeared in the middle of a file. UFS takes the simplest approach to assuring metadata integrity: it writes metadata synchronously and requires an extensive fsck on recovery from a system crash. The time and expense of the fsck operation is proportional to the number of files in the file system being checked. Large file systems with millions of small files can take tens of hours to check. Logging file systems were developed to avoid both the ongoing performance issues associated with synchronous writes and excessive time for recovery.
Logging uses the two-phase commit technique to ensure that metadata updates are either fully updated on disk, or that they will be fully updated on disk upon crash recovery. Logging implementations store pending metadata in a reserved area, and then update the master file system based on the content of the reserved area or log. In the event of a crash, metadata integrity is assured by inspecting the log and applying any pending metadata updates to the master file system before accepting any new I/O operations from applications. The size of the log is dependent on the amount of changing metadata, not the size of the file system. Because the amount of pending metadata is quite small, usually on the order of a few hundred kilobytes for typical file systems and several tens of megabytes for very busy file systems. Replaying the log against the master is therefore a very fast operation. Once the metadata integrity is guaranteed, the fsck operation becomes a null operation and crash recovery becomes trivial. Note that for performance reasons, only metadata is logged; user data is not logged.
The metatrans implementation was the first version of UFS to implement logging. It is built into Solstice DiskSuiteTM or SolarisTM Volume Manager software (the name of the product depends on the version of the code, but otherwise, they are the same). The metatrans implementation was integrated in the Solaris 7 OS foundation, the two versions were the same. The only difference visible to users is that the integrated version stores the log internally rather than on a separate device. Although one would expect that performance would be better with separate devices, this did not prove to be the case due to the typical access patterns to UFS files The extra administrative overhead was accordingly removed.
The Solaris Volume Manager software version was withdrawn when it was integrated into the Solaris 8 OS. It is recommended only for very old releases (the Solaris 2.5.1 and Solaris 2.6 OSs) in which logging UFS (LUFS) is not available. Although LUFS has been integrated since the Solaris 7 OS, logging is not enabled by default. This is due to performance degradation, found typically only at artificially high-load levels, and almost no cases have been seen in the field.
As of the Solaris 10 OS, logging is enabled by default, although in practice, Sun recommends using logging any time that fast crash recovery is required with releases as early as the Solaris 8 OS first customer shipment (FCS). This is particularly true of root file systems, which do not sustain enough I/O to trip even the rather obscure performance bugs found in the Solaris 7 OS.
Performance Impact of Logging
One of the most confusing issues associated with logging file systems (and particularly with logging UFS, for some reason) is the effect that the log has on performance. First, and most importantly, logging has absolutely no impact on user data operations; this is because only metadata operations are logged.
The performance of metadata operations is another story, and it is not as easy to describe. The log works by writing pending changes to the log, then actually applying the changes to the master file system. When the master is safely updated, the log entry is marked as committed, meaning that it does not need to be reapplied to the master in the event of a crash. This algorithm means that metadata changes that are accomplished primarily when creating or deleting files might actually require twice as many physical I/O operations as a non-logging implementation. The net impact of this aspect of logging performance is that there are more I/O operations going to storage. Typically, this has no real impact on overall performance, but in the case where the underlying storage was already nearly 100 percent busy, the extra operations associated with logging can tip the balance and produce significantly lower file system throughput. (In this case, throughput is not measured in megabytes per second, but rather in file creations and deletions per second.) If the utilization of the underlying storage is less than approximately 90 percent, the logging overhead is inconsequential.
On the positive side of the ledger, the most common impact on performance has to do with the cancellation of some physical metadata operations. These cases occur only when metadata updates are issued very rapidly, such as when doing a tar (1) extract operation or when removing the entire contents of a directory ("rm -f *"). Without logging, the system is required to force the directory to disk after every file is processed (this is the definition of the phrase "writing metadata synchronously); the effect is to write 512 or 2048 bytes every time 14 bytes is changed. When the file system is logging, the log record is pushed to disk when the log record fills, often when the 512 byte block is completed. This results in a 512/14 = 35 times reduction in physical I/O, and obvious performance improvements result.
The following table illustrates these results. The times are given in seconds, and lower scores are better. Times are the average of five runs, and are intended to show relative differences rather than the fastest possible absolute time. These tests were run on Solaris 8 7/01 using a single disk drive.
TABLE 1 Analyzing the Impact of UFS Logging on Performance
Test |
No Logging (seconds) |
Logging (seconds) |
Delta |
tar extract |
127 |
21 |
505% |
rm -rf * |
76 |
2 |
3700% |
Create 1 GB file |
35 |
34 |
2.94% |
Read 1 GB file |
34 |
34 |
0.00% |
The tar test consists of extracting 7092 files from a 175 megabyte archive (the contents of /usr/openwin). Although a significant amount of data is moved, this test is dominated by metadata updates for creating the files. Logging is five times faster. The rm test removes the 7092 extracted files. It is also dominated by metadata updates and is an astonishing 37 times faster than the non-logging case.
On the other hand, the dd write test creates a single 1 gigabyte file in the file system, and the difference between logging and non-logging is a measurable, but insignificant, three percent. Reading the created file from the file system shows no performance impact from logging. Both tests use large block sizes (1 megabyte per I/O) to optimize throughput of the underlying storage.
UFS Direct I/O
Another feature present in most of the local file systems is the use of direct I/O. UFS, VxFS, and QFS all have forms of this feature, which is primarily intended to avoid the overhead associated with managing cache buffers for large I/O. At first glance, it might seem that caching is a good thing and that it would improve I/O performance.
There is a great deal of reality underlying these expectations. All of the local file systems perform buffer caching by default. The expected improvements occur for typical workloads that are dominated by metadata manipulation and data sets that are very small when compared to main memory sizes. Metadata, in particular, is very small, amounting to less than one kilobyte per file in most UFS applications, and only slightly more in other file systems. Typical user data sets are also quite small; they average about 70 kilobytes. Even the larger files used in every day work such as presentations created using StarOfficeTM software, JPEG images, and audio clips are generally less than 2 megabytes. Compared to typical main memory sizes of 2562048 megabytes, it is reasonable to expect that these data sets and their attributes can be cached for substantial periods of time. They are reasonably likely to still be in memory when they are accessed again, even if that access comes an hour later.
The situation is quite different with bulk data. Systems that process bulk data tend to have larger memories, up to perhaps 16 gigabytes (for example, 864 times larger than typical), but the data sets in these application spaces often exceed 1 gigabyte and sometimes range into the tens or even hundreds of gigabytes. Even if the file literally fits into memory and could theoretically be cached, these data sets are substantially larger than memory that is consistently available for I/O caching. As a result, the likelihood that the data will still be in cache when the data is referenced again is quite low. In practice, cache reuse in these environments is nil.
Direct I/O Performance
Caching data anyway would be fine except, that the process requires effort on the part of the OS and processors. For small files, this overhead is insignificant. However, the overhead becomes not only significant, but excessive when "tidal waves" of data flow through the system. When reading 1 gigabyte of data from a disk in large blocks, throughput is similar for both direct and buffered cases; the buffered case delivers 13 percent greater throughput. The big difference between these two cases is that the buffered process consumes five times as much CPU effort. Because there is so little practical value to caching large data sets, Sun recommends using the forcedirectio option on file systems that operate on large files. In this context, large generally means more than about 1520 megabytes. Note that the direct I/O recommendation is especially true when the server in question is exporting large files through NFS.
If direct I/O is so much more efficient, why not use direct I/O all the time? Direct I/O means that caching is disabled. The impact of standard caching becomes obvious when using a UFS file system in direct I/O mode while doing small file operations. The same tar extraction benchmark used in the logging section above takes over 51 minutes, even with logging enabled, more than 29 times as long as when using regular caching (2:08)! The benchmark results are summarized in the following table.
TABLE 2 Analyzing the Performance of Direct I/O and Buffered I/O
|
Direct I/O Throughput (seconds) |
CPU % |
Buffered I/O Throughput (seconds) |
CPU % |
Create 1 GB file |
36 |
5.0% |
31 |
25.00% |
Read 1 GB file |
30 |
0.0% |
22 |
22.00% |
tar extract |
3062 |
0.0% |
128 |
6.0% |
rm rf * |
76 |
1.2% |
65 |
1.0% |
In this table, throughput is represented by elapsed times in seconds, and smaller numbers are better. The system in question is running Solaris 9 FCS on a 750-megahertz processor. The tests are disk-bound on a single 10K RPM Fibre Channel disk drive. The differences in throughput are mainly attributable to how the file system makes use of the capabilities of the underlying hardware.
Supercaching and 32-Bit Binaries
A discussion of buffered and direct I/O methodology is incomplete without addressing one particular attribute of the cached I/O strategy. Because file systems are part of the operating system, they can access the entire capability of the hardware. Of particular relevance is that file systems are able to address all of the physical memory, which now regularly exceeds the ability of 32-bit addressing. As a result, the file system is able to function as a kind of memory management unit (MMU) that permits applications that are strictly 32-bit aware to make direct use of physical memories that are far larger than their address pointers.
This technique, known as supercaching, can be particularly useful to provide extended caching for applications that are not 64-bit aware. The best examples of this are the open-source databases, MySQL and Postgres. Both of these are compiled in 32-bit mode, leaving their direct addressing capabilities limited to 4 gigabytes.1 However, when their data tables are hosted on a file system operating in buffered mode, they benefit from cached I/O. This is not as efficient as simply using a 64-bit pointer because the application must run I/O system calls instead of merely dereferencing a 64-bit pointer, but the advantages gained by avoiding I/O outweigh these considerations by a wide margin.
Handling Very Large Data Sets With QFS/Local
Whereas UFS was designed as a general-purpose file system to handle the prosaic needs of typical users, QFS originates from a completely different design center. QFS is designed with bulk data sets in mind, especially those with high bandwidth requirements. These files are typically accessed sequentially at very high bandwidth. As previously noted, attempts to cache such files are usually futile.
The key design features of QFS/local are the ability to span disk devices, to separate data and metadata, and to explicitly handle underlying data devices and associated explicit user policies for assigning data locations to specific devices. Taken together, the net effect of these features is to create a file system that can handle massive data sets and provide mechanisms for accessing them at very high speeds.
QFS Direct I/O
As one might expect from such design criteria, QFS also offers direct I/O capabilities, and for the same reasons that UFS has them. QFS's massive I/O throughput capability puts an even greater premium on eliminating overhead than with UFS. The major difference between the QFS and UFS direct I/O implementations is that QFS allocates a per-file attribute, permitting administrators more control of which files are accessed without the cache. This can be set with the setfa (1m) command. As with UFS, direct I/O can be selected on an entire file system using a mount option or on an individual file using a call to ioctl (2).
QFS Volume Management
Whereas UFS and VxFS are hosted on a single disk resource, QFS can be hosted on multiple disk resources. In this context, a disk resource is anything that presents the appearance of a single virtual disk. This could be a LUN exported out of a disk array, a slice of a directly attached disk drive, or some synthesis of one or more of these created by a volume manager such as VxVM or Solaris Volume Manager software. The main point is that there are limits to virtual disk resources. In particular, a 1 terabyte maximum size exists when the file system is hosted on a single disk resource, and it is necessarily limited to the size of that resource2.
QFS essentially includes a volume manager in its inner core. A QFS file system is hosted on top of disk groups. (Do not confuse these with the completely unrelated VxVM concept of the same name.) A QFS disk group is a collection of disk resources that QFS binds together internally.
There are two types of disk group: round-robin and striped. The striped disk group is effectively the same thing as a RAID-0 of the underlying disk resources. Blocks are logically disbursed across each of the constituent disk resources according to a RAID-0 organization. One might use this configuration to maximize the available I/O bandwidth from a given underlying storage configuration.
Disk Organizations
The round-robin disk group is one of the most interesting features in QFS. Like striped disk groups, round-robin disk groups permit a file system to span disk resources. The difference is that the round-robin disk group is explicitly known to the block allocation procedures in QFS. More specifically, all blocks for a given file are kept in a single disk resource within the disk group. The next file is allocated out of another disk resource, and so on. This has the property of segregating access to the data set to a specific set of disk resources. For typical high-bandwidth, relatively low-user-count applications, this is a major advance over the more common striping mechanisms because it allows bandwidth and disk resource attention to be devoted to servicing access to fewer files. In contrast, striped groups provide greater overall bandwidth, but they also require that every device participate in access to all files.
To see how round-robin groups can improve performance over striped groups, consider a QFS file system containing two large files and built on two devices. If two processes each access one file, the access pattern to the underlying disks is very different, depending on whether the file system uses striped or round-robin groups.
In the round-robin case, each file is stored on a single device. Therefore, sequential access to each file results in sequential access to the underlying device, with no seeks between disk accesses. If the file system uses striped groups, each file resides on both devices; therefore, each of the two processes send access to both drives, and the access is interleaved on the disk. As a result, virtually every I/O requires a seek to get from one file to the other, dropping the throughput of the underlying disks by a factor of 520 times. The following table illustrates these results. Results are given in megabytes per second, and larger numbers are better.
TABLE 3 Analyzing File Access in Round-Robin and Striped Groups
I/O Size (KB ) |
Segregated |
Interleaved |
Ratio |
8 |
37.0 |
3.2 |
11.7 |
32 |
66.7 |
3.4 |
19.9 |
64 |
71.4 |
5.0 |
14.4 |
128 |
71.4 |
8.1 |
8.9 |
512 |
71.4 |
13.2 |
5.4 |
1024 |
71.4 |
14.3 |
5.0 |
In each case, two I/O threads each issue sequential requests to the disks using the listed I/O size. In the segregated case, each thread has a dedicated disk drive; in the interleaved case, the data is striped across both drives, so both threads issue requests to both drives.
The test system uses two disk drives, each delivering about 36 megabytes per second. The segregated case goes about as fast as theory suggests, while the seeks required in the interleaved case result in throughput that is far lower than users would expect.
Segregation of User and Metadata
The principle of segregating I/Os to reduce contention for physical disk resources is why QFS has options for placing metadata on different disk resources than user data. Moving the seek-laden metadata access to dedicated disk resources permits most of the disk resources to concentrate on transferring user data at maximum speed.
Note that even though QFS includes things that are like volume managers, it is still possible and useful to use other disk aggregation technologies such as volume managers and RAID disk arrays. For example, it is often useful to use the RAID-5 implementations inside disk arrays, aggregated together to form disk groups, either directly or indirectly through a volume manager. Of course, the complexity of the solution increases with each added step, but sometimes this complexity is appropriate when other goals (performance, reliability, and the like) cannot be met in other ways.
One of the obvious consequences of QFS including a volume manager is that it can accommodate file systems that are far larger than a single volume or LUN. The theoretical limit of a single QFS on disk file system is in excess of 250 terabytes, although no single on-disk file system of such size is currently deployed. When aggregate data sizes get this large, it is far more cost effective to deal with them in a mixed online and nearline environment made possible by hierarchical storage management software such as SAM. Hierarchical storage management (HSM) systems enable the effective management of single file systems extending into multiple petabytes.
Understanding the Differences Between QFS/Local and SAMFS
SAMFS is something of a confusing term. Strictly speaking, it refers to a local file system that is strongly related to QFS/local. More specifically, it is functionally the same as QFS/local, except that it does not offer the ability to place user data and metadata on separate devices. In my opinion, there is so little difference between QFS/local and SAMFS that they can be treated together as QFS/local.
Managing Storage Efficiency With SAM
Unfortunately, the related term "SAM" is often confusingly referred to as SAMFS. SAM really refers to storage, archive, and migration, and is a hierarchical storage facility. SAM is actually built into the same executables as QFS and SAMFS, and is separately licensed.
The combination of QFS+SAM is, in many ways, almost a different file system. SAM is a tool for minimizing the cost of the storage that supports data. Its primary goal is to find data that is not being productively stored and to migrate it transparently to lower-cost storage. This can result in rather surprising savings.
A number of studies over the years have shown that a huge proportion of data is actually dead, in the sense that it will not be referenced again. Studies have shown that daily access to data is often as little as one percent3, and that data not accessed in three days has a 5585 percent probability of being completely dead in some studied systems4. Clearly, moving unused or low-usage data to the least expensive storage will help save expensive, high-performance storage for more important, live data.
One of SAM's daemons periodically scans the file system searching for migration candidates. When suitable candidates are found, they are copied to the destination location and the directory entry is updated to reflect the new location. File systems maintain location information about where each file's data is stored. In traditional on-disk file systems such as UFS, the location is restricted to the host disk resource. SAM's host file systems (QFS and SAMFS) extend this notion to include other locations, such as an offset and file number on a tape with a particular serial number. Because mapping the file name to data location is handled inside the file system, the fact that the data has been migrated is completely invisible to the application, with the possible exception of how long it might take to retrieve data.
Partial Migration
To augment these traditional HSM techniques, SAM offers options for migration, including movement of partial data sets and disk-to-disk migration. Partial migration means that the file system has the notion of segmentation; each file is transparently divided into segments that can be staged or destaged independently. In effect, this feature provides each segment with its own location data and last-modified times. The feature is especially useful for very large data sets because applications might not actually reference all of a large data set. In this case, restaging the entire set is both slow and wasteful.
For example, the use of file (1) on an unsegmented file illustrates the extent to which transfer times can be affected by segmentation. In this case, the program reads the first few bytes of a file and compares them to a set of known signatures in an attempt to identify the contents of the file. If file (1) is applied to a stale, unsegmented 10-terabyte file, the entire 10 terabytes must be staged from tape! However, such large files are normally segmented, and only the referenced segment is restaged. Because segments can be as small as 1 megabyte, this represents a substantial savings in data transfer time.
When SAM destages data sets, it usually writes them to a lower-cost, nearline medium such as tape or, in some cases, optical disk. However, these media are generally quite slow compared to disks, especially for recovery scenarios. Tapes are inherently serial access devices, and even tapes with rapid-seek capability, such as the STK 9940B, locate data far more slowly than disks. Tapes take tens of seconds to locate a specific byte in a data set, compared to tens of milliseconds on a disk, a disparity of five orders of magnitude. Traditional transports such as DLT-8000 can take even longer (several minutes). Note also that locating a cartridge and physically mounting a tape also takes an interesting period of time, which is especially true in the event that human intervention is required to reload a tape into a library, or if all transports are busy.
Cached Archiving on Disk
In many cases, it might be preferable to archive recently used data on lower-cost disks, rather than on relatively inconvenient tape media. For example, archiving to disk might be preferable to avoid giving users the impression that they are losing access to their data. In these circumstances, SAM can be directed to place the archive files on disk. The destination can be any file system available to the archiver process, including QFS, UFS, PxFS, or even NFS.
The archive files placed on the destination disk have the same format as tape archives. This means that small files are packed into large archive files for placement on disk. This is significantly more efficient than simply copying the files because there is no fragmentation within the archive file. Therefore, space is more efficiently utilized and the file system itself only needs to manage the name of the archive file itself, avoiding any hint of problems with directory sizes. Furthermore, access to the archives is a serial I/O problem, which is much more efficient than typical random access I/O found when the files are copied wholesale (for example, with cp -pr /from/* /to or similar.
One of the reasons that on-disk archives have the same format as tape archives is because one of SAM's key features is that it is able to save multiple copies of a single data set. Multiple copies might be saved to different tapes, or one might be saved on disk, two on tape, and one on optical. This makes it possible to keep a cached copy of archived data sets on disk while still retaining them in safe off-site storage.
One of the advantages of retaining a cache copy on disk is that the disk underlying the cached copy does not need to be expensive or particularly reliable. Even if disks were to fail, data can be recovered transparently from tape copies.
Backup Considerations
One unexpected benefit from the use of an on-disk archive is a large improvement in backup speed, with the possible consequence of dramatically shortening backup windows. Such improvements will only occur in environments that are dominated by small files. The problem is that backup processes must read small files from the disk in order to copy them to tape. Small file I/O is dominated by random access (seek) time and small physical I/Os. These two factors combine to reduce the effective bandwidth of a disk by approximately 9597 percent. In environments primarily consisting of small files, this is sometimes enough to extend backup windows unacceptably when the backup is written directly to tape. This is because the backup data is being supplied to the tape at rates far lower than the rated speed of the tape transports. During this time, the tape is opened for exclusive use and is unavailable for any other purpose.
An on-disk archive can be constructed at leisure because the process does not require exclusive access to a physical device. The on-disk archives have a different form; specifically, many small files are coalesced into large archive files. When the on-disk archive is complete, the resulting large archive files can be copied to tape at streaming media speeds. Because the tape transports can be easily driven to their full-rated speed without fear of backhitching, they are used efficiently both in time and in capacity. Backhitches, an inherent characteristic of helical-scan media such as DLT, DAT, and some other technologies, can sometimes cause transports to disable compression because a higher effective incoming data rate might be able to keep the transport streaming. In these cases, the on-disk archive even improves tape density.