- Understanding What a File System Is
- Understanding File System Taxonomy
- Understanding Local File System Functionality
- Understanding Differences Between Types of Shared File Systems
- Understanding How Applications Interact With Different Types of File Systems
- Conclusions
- About the Author
Understanding How Applications Interact With Different Types of File Systems
Although there are a wide variety of ways to categorize applications, one of the more straightforward methods is to consider the size of the data sets involved. This means both the size of individual data sets as well as the size of the logical groups of data sets; these correspond to the size of the files of significance and the file systems that will store them together. As we saw in the descriptions of each file system, one of the key engineering design tradeoffs that is made in the development of each file system is the way it handles scalability in the dimensions of size and number of clients.
In addition to the size of files, there are several other broad application categories that include the most common applications found in the Solaris environment. Database systems such as Oracle, DB2, MySQL, and Postgres contain a large proportion of all data on these servers. The applications that use the databases can be divided into two qualitatively different categories: online transaction processing (OLTP) and decision support (DS) systems.
OLTP work is primarily concerned with recording and utilizing the transactions that represent the operation of a business. In particular, the dominant operations focus on individual transactions, which are relatively small. Decision support or business intelligence applications are concerned with identifying trends and patterns in transactions captured by OLTP systems. The key distinction is that DSS systems process groups of transactions, while OLTP systems deal with individual transactions.
For the most part, OLTP systems experience activity that is similar to small-file work, while DSS systems experience activity that is similar to large-file workloads. High performance technical computing (HPTC) is a basic category independent of the others. This group typically makes use of the largest files and usually also creates the highest demand on I/O bandwidth. The files that support various Internet service daemons, such as ftp servers, web servers, LDAP directories, and the like typically fall into the small-file category (even ftp servers). Occasionally, the concentrated access visibility will create higher than usual I/O rates to these files, but usually they are just more small files to a file system. One reason that ftp servers fall into this category, even if they are actually providing large files (for example, the 150650 megabyte files associated with large software distributions such as StarOfficeTM, Linux, or Solaris) is that the ftp server's outbound bandwidth constrains individual I/O request rates for individual streams, causing physical I/O to be strongly interleaved. The result is activity that has the same properties as small-file activity rather than the patterns that are more commonly associated with files of this size.
The activities and I/O characteristics associated with these primary workload groups can be used to assign file systems to applications in a fairly straight-forward way.
Selecting a Shared File System
Deciding which shared file system to use is fairly straightforward because there is relatively little overlap between the main products in this space. The three primary considerations are as follows:
The level of cooperation between the clients and server (for example, are they members of a cluster or not)
The size of the most important data sets
The sensitivity of the data
PxFS is the primary solution sharing data within a cluster. Although NFS works within a cluster, it has somewhat higher overhead than PxFS, while not providing features that would be useful within luster.
QFS/SW is being evaluated in this application today, since the security model is common between Sun Cluster software and QFS/SW. Furthermore, Sun Cluster software and QFS/SSW usually use the same interconnectivity model when the nodes are geographically collocated. Early indications suggest that QFS/SW might be able to serve a valuable role in this application with lower overhead, but such configurations are not currently supported because the evaluation is not complete.
For clients that are not clustered together, access to small files should use NFS unless the server that owns the files happens to run Windows. In this context, "small" means files that are less than a few megabytes or so.
The most common examples of small-file access are typical office automation home directories, software distribution servers (such as /usr/dist within Sun), software development build or archive servers, and support of typical Internet services such as web pages, mail and even typical ftp sites. This last category usually occurs when multiple systems are used to provide service to a large population using a single set of shared data.The clearest example is the use of 1020 web servers in a load-balanced configuration, all sharing the same highly available and possibly writable data.
CIFS can also be used in this same space, but the lack of an efficient kernel implementation in Solaris, combined with the lack of any CIFS client package6 mean that CIFS will be used almost exclusively by Windows and Linux systems. Because server hardware and especially file system implementations scale to thousands of clients (NFS) or at least hundreds of clients (CIFS) per server, they are clearly preferred when processing is diverse and divided amongst many clients.
The other main application space is that of bulk data, applications in which the most interesting data sets are larger than a few megabytes and often vastly larger. Applications such as satellite download, oil and gas exploration, movie manipulation and other scientific computations often manipulate files that are tens of gigabytes in size. The leading edge of these applications are approaching 100 terabytes per data set. Decision support or business intelligence systems have data set profiles that would place them in this category, but these systems do not typically use shared data.
Because of the efficiencies of handling bulk data through direct storage access, QFS/SW is the primary choice in this space. The nature of bulk data applications generally keeps data storage physically near the processing systems, and these configurations typically enjoy generous bandwidth. However, NFS is required in the uncommon case where Fibre Channel is not the storage interconnection medium or where disk interconnect bandwidth is quite limited. For example, if the distance between processor and storage is more than a few kilometers, NFS is the most suitable alternative.
A consideration that is receiving attention is security. Highly sensitive data might require substantial security measures that are not available in QFS/SW. Data privacy is typically addressed by encryption. Neither QFS/SW nor Fibre Channel have any effective encryption capability. This is hardly surprising given the performance goals associated with each of these technologies. In contrast, NFS can be configured with Kerberos-v5 for more secure authentication and/or privacy. Security can also be addressed at the transport level. The NFS transport is IP, which can optionally be configured with IPsec link-level encryption. No corresponding capability is available in Fibre Channel.
In addition to these implementation differences, there is a subtle architectural difference between QFS/SW and NFS. QFS/SW clients trust both the storage and the other clients. NFS clients only need to trust the server as long as the server is using a secure authentication mechanism.
Selecting a Local File System
As with shared file systems, deciding which local file system to apply to given applications centers on the size of the interesting files. The local file systems have the additional consideration of how large the file system is, and what proportion of it might be dead or inactive.
When the size of the important files is relatively small any of the local fail systems will handle the job. For example, files in most office automation home directories average around 70 kilobytes today, despite an interesting proportion of image files, MP3 files, massive StarOffice presentations, and archived mail folders.
How those files will be accessed is not a discriminator. The files might be used on the local server (such as on a time-sharing system) or being exported through a file-sharing protocol for use on clients elsewhere in the network. Because logging UFS is a part of the Solaris OS, it is usually recommended for these applications, especially for the Solaris 9 OS and later with the availability of recent improvements such as hashed DNLC handling, logging performance, and optional suppression of access time updates.
In the small-file arena, Sun recommends SAM functionality (for example, either QFS or SAMFS with the archiving features enabled) over logging UFS when the size of the file systems becomes truly massive or when the proportion of dead or stale files represents an interesting overhead cost. Although it might seem counter intuitive that small files could result in massive file systems, several fairly common scenarios result in such installations. One is sensor data capture, in which a large number of sensors each report a small amount of data. Another is server consolidation as applied to file servers. This particular application has been implemented within Sun's own infrastructure systems and has resulted in many very large file systems full of small, stale files.
One major exception to the rule of small files is the use of database tables in a file system. Because databases carefully manage the data within their tablespaces, the files visible to the file system have some special characteristics that make them particularly easy for relatively simple file systems to manage them. The files tend to be constantly "used" in that the database always opens all of them, and the files are essentially never extended. Finally, the databases have their own caching and consistency mechanisms, so access is almost always quite careful and deliberate.
Operating under these assumptions, UFS is easily able to handle database tables used for OLTP systems. At one time (Solaris 2.3, 40 megahertz SuperSPARC®), the performance difference between raw disk devices and UFS tablespaces was quite substantial: performance was about 40 percent lower on file systems. Since that time, constant evolution of the virtual memory system, file system, and microprocessors have narrowed this gap to about 5 percent, an amount that is well worth the savings ease of administration.
Because the databases access many parts of the tablespaces, placing database tables on SAM file systems is rarely worthwhile. The access patterns result in files that are fully staged almost all the time; there is little or nothing to be gained for the complexity.
For a variety of reasons, file systems that contain large files not associated with database tables are best handled with QFS. Probably the most important consideration is linear (streaming) performance because QFS's storage organization is specifically designed to accommodate the needs of large data sets. The performance characteristics of the QFS storage organization also mean that databases hosted on file systems that are primarily used for decision-support applications will perform best on QFS rather than on UFS.
As with small files, the intended use of the file is not particularly material. QFS is usually the most appropriate file system for large, data-intensive applications, whether the local system is doing the processingas in a high performance computing (HPC) environmentor if it is exporting the data to some other client for processing. In particular, data-intensive NFS applications (such as geological exploration systems) should use QFS.
As one might expect, file systems that operate on very large data sets are prime candidates for SAM archiving. It doesn't take very many idle 50-gigabyte data sets to require the management of a few more disk arrays. This is also the context in which segmented files are most useful. Because most of the overhead is in the manipulation of metadata and not user data, there is very little additional work for the system to do and substantial improvement in smoothness of handling the data through the system. In fact, a reasonable default policy might be to segment all large files at 1 gigabyte boundaries. The 1 gigabyte size permits a reasonable tradeoff between consumption of inodes (and especially their corresponding in-memory data structures) and staging mechanism utilization.
Accelerating Operations With tmpfs
The special file system whose application requires some explanation is tmpfs. As the name implies, tmpfs by definition stores its data in virtual memory. tmpfs is volatile; data does not persist across a reboot (whether intentional or not). The intent behind tmpfs is to accelerate operations on data that will not be saved permanently; for example, the intermediate files that are passed between the various phases of a compiler, sort temporary files, and editor scratch files. The running instances of these applications are lost in the event of a reboot, so loss of their temporary data is of no interest.
Even if data is temporary, there is no advantage to putting it in a tmpfs if it is large enough to cause the VM system to have to make special efforts to manage it. As a rule of thumb, tmpfs ceases to be productive when the data sets exceed about a third of physical memory configured in the system. When the data sets get this big, they tend to compete with the applications for physical memory and this ends up lowering overall system performance. (Obviously, a truly massive number of small files exceeding 35 percent of memory is also a problem.)
In a very small number of cases, tmpfs has been used to accelerate some specific database operations. This is a bit complex because it involves having a tablespace created in the temporary file system before the database is brought up. This has been done by creating and initializing the tablespace, then copying it to a reference location on stable storage. When the system comes up, the reference table is copied to a well-known location on a tmpfs, and then the database is started. This is a somewhat extreme solution, and most database systems are now capable of making direct use of very large (64-bit) shared memory pools, which was the original reason why the tmpfs technique was developed. However, this might be another useful trick for optimizing the performance of database engines that have not been extended to fully utilize 64-bit address spaces (notably MySQL and Postgres).