5.6 File System Caches
File systems make extensive use of caches to eliminate physical I/Os where possible. A file system typically uses several different types of cache, including logical metadata caches, physical metadata caches, and block caches. Each file system implementation has its unique set of caches, which are, however, often logically arranged, as shown in Figure 5.4.
Figure 5.4 File System Caches
The arrangement of caches for various file systems is shown below:
- UFS. The file data is cached in a block cache, implemented with the VM system page cache (see Section 14.7 in Solaris™ Internals). The physical meta-data (information about block placement in the file system structure) is cached in the buffer cache in 512-byte blocks. Logical metadata is cached in the UFS inode cache, which is private to UFS. Vnode-to-path translations are cached in the central directory name lookup cache (DNLC).
- NFS. The file data is cached in a block cache, implemented with the VM system page cache (see Section 14.7 in Solaris™ Internals). The physical meta-data (information about block placement in the file system structure) is cached in the buffer cache in 512-byte blocks. Logical metadata is cached in the NFS attribute cache, and NFS nodes are cached in the NFS rnode cache, which are private to NFS. File name-to-path translations are cached in the central DNLC.
- ZFS. The file data is cached in ZFS's adaptive replacement cache (ARC), rather than in the page cache as is the case for almost all other file systems.
5.6.1 Page Cache
File and directory data for traditional Solaris file systems, including UFS, NFS, and others, are cached in the page cache. The virtual memory system implements a page cache, and the file system uses this facility to cache files. This means that to understand file system caching behavior, we need to look at how the virtual memory system implements the page cache.
The virtual memory system divides physical memory into chunks known as pages; on UltraSPARC systems, a page is 8 kilobytes. To read data from a file into memory, the virtual memory system reads in one page at a time, or "pages in" a file. The page-in operation is initiated in the virtual memory system, which requests the file's file system to page in a page from storage to memory. Every time we read in data from disk to memory, we cause paging to occur. We see the tally when we look at the virtual memory statistics. For example, reading a file will be reflected in vmstat as page-ins.
In our example, we can see that by starting a program that does random reads of a file, we cause a number of page-ins to occur, as indicated by the numbers in the pi column of vmstat.
There is no parameter equivalent to bufhwm to limit or control the size of the page cache. The page cache simply grows to consume available free memory. See Section 14.8 in Solaris™ Internals for a complete description of how the page cache is managed in Solaris.
# ./rreadtest testfile& # vmstat procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 -- -- -- in sy cs us sy id 0 0 0 50436 2064 5 0 81 0 0 0 0 15 0 0 0 168 361 69 1 25 74 0 0 0 50508 1336 14 0 222 0 0 0 0 35 0 0 0 210 902 130 2 51 47 0 0 0 50508 648 10 0 177 0 0 0 0 27 0 0 0 168 850 121 1 60 39 0 0 0 50508 584 29 57 88 109 0 0 6 14 0 0 0 108 5284 120 7 72 20 0 0 0 50508 484 0 50 249 96 0 0 18 33 0 0 0 199 542 124 0 50 50 0 0 0 50508 492 0 41 260 70 0 0 56 34 0 0 0 209 649 128 1 49 50 0 0 0 50508 472 0 58 253 116 0 0 45 33 0 0 0 198 566 122 1 46 53
You can use an MDB command to view the size of the page cache. The macro is included with Solaris 9 and later.
sol9# mdb -k Loading modules: [ unix krtld genunix ip ufs_log logindmux ptm cpc sppp ipc random nfs ] > ::memstat Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 53444 208 10% Anon 119088 465 23% Exec and libs 2299 8 0% Page cache 29185 114 6% Free (cachelist) 347 1 0% Free (freelist) 317909 1241 61% Total 522272 2040 Physical 512136 2000
The page-cache-related categories are described as follows:
- Exec and libs. The amount of memory used for mapped files interpreted as binaries or libraries. This is typically the sum of memory used for user binaries and shared libraries. Technically, this memory is part of the page cache, but it is page-cache-tagged as "executable" when a file is mapped with PROT_EXEC and file permissions include execute permission.
- Page cache. The amount of unmapped page cache, that is, page cache not on the cache list. This category includes the segmap portion of the page cache and any memory mapped files. If the applications on the system are solely using a read/write path, then we would expect the size of this bucket not to exceed segmap_percent (defaults to 12% of physical memory size). Files in /tmp are also included in this category.
- Free (cache list). The amount of page cache on the free list. The free list contains unmapped file pages and is typically where the majority of the file system cache resides. Expect to see a large cache list on a system that has large file sets and sufficient memory for file caching. Beginning with Solaris 8, the file system cycles its pages through the cache list, preventing it from stealing memory from other applications unless a true memory shortage occurs.
The complete list of categories is described in Section 6.4.3 and further in Section 14.8 in Solaris™ Internals.
With DTrace, we now have a method of collecting one of the most significant performance statistics for a file system in Solaris—the cache hit ratio in the file system page cache. By using DTrace with probes at the entry and exit to the file system, we can collect the logical I/O events into the file system and physical I/O events from the file system into the device I/O subsystem.
#!/usr/sbin/dtrace -s #pragma D option quiet ::fop_read:entry /self->trace == 0 && (((vnode_t *)arg0)->v_vfsp)->vfs_vnodecovered/ { vp = (vnode_t*)arg0; vfs = (vfs_t *)vp->v_vfsp; mountvp = vfs->vfs_vnodecovered; uio = (uio_t*)arg1; self->path=stringof(mountvp->v_path); @rio[stringof(mountvp->v_path), "logical"] = count(); @rbytes[stringof(mountvp->v_path), "logical"] = sum(uio->uio_resid); self->trace = 1; } ::fop_read:entry /self->trace == 0 && (((vnode_t *)arg0)->v_vfsp == `rootvfs)/ { vp = (vnode_t*)arg0; vfs = (vfs_t *)vp->v_vfsp; mountvp = vfs->vfs_vnodecovered; uio = (uio_t*)arg1; self->path="/"; @rio[stringof("/"), "logical"] = count(); @rbytes[stringof("/"), "logical"] = sum(uio->uio_resid); self->trace = 1; } ::fop_read:return /self->trace == 1/ { self->trace = 0; } io::bdev_strategy:start /self->trace/ { @rio[self->path, "physical"] = count(); @rbytes[self->path, "physical"] = sum(args[0]->b_bcount); } tick-5s { trunc (@rio, 20); trunc (@rbytes, 20); printf("\033[H\033[2J"); printf ("\nRead IOPS\n"); printa ("%-60s %10s %10@d\n", @rio); printf ("\nRead Bandwidth\n"); printa ("%-60s %10s %10@d\n", @rbytes); trunc (@rbytes); trunc (@rio); }
These two statistics give us insight into how effective the file system cache is, and whether adding physical memory could increase the amount of file-system-level caching.
Using this script, we can probe for the number of logical bytes in the file system through the new Solaris 10 file system fop layer. We count the physical bytes by using the io provider. Running the script, we can see the number of logical and physical bytes for a file system, and we can use these numbers to calculate the hit ratio.
Read IOPS /data1 physical 287 /data1 logical 2401 Read Bandwidth /data1 physical 2351104 /data1 logical 5101240
The /data1 file system on this server is doing 2401 logical IOPS and 287 physical—that is, a hit ratio of 2401 ÷ (2401 + 287) = 89%. It is also doing 5.1 Mbytes/sec logical and 2.3 Mbytes/sec physical.
We can also do this at the file level.
#!/usr/sbin/dtrace -s #pragma D option quiet ::fop_read:entry /self->trace == 0 && (((vnode_t *)arg0)->v_path)/ { vp = (vnode_t*)arg0; uio = (uio_t*)arg1; self->path=stringof(vp->v_path); self->trace = 1; @rio[stringof(vp->v_path), "logical"] = count(); @rbytes[stringof(vp->v_path), "logical"] = sum(uio->uio_resid); } ::fop_read:return /self->trace == 1/ { self->trace = 0; } io::bdev_strategy:start /self->trace/ { @rio[self->path, "physical"] = count(); @rbytes[self->path, "physical"] = sum(args[0]->b_bcount); } tick-5s { trunc (@rio, 20); trunc (@rbytes, 20); printf("\033[H\033[2J"); printf ("\nRead IOPS\n"); printa ("%-60s %10s %10@d\n", @rio); printf ("\nRead Bandwidth\n"); printa ("%-60s %10s %10@d\n", @rbytes); trunc (@rbytes); trunc (@rio); }
5.6.2 Bypassing the Page Cache with Direct I/O
In some cases we may want to do completely unbuffered I/O to a file. A direct I/O facility in most file systems allows a direct file read or write to completely bypass the file system page cache. Direct I/O is supported on the following file systems:
-
UFS. Support for direct I/O was added to UFS starting with Solaris 2.6. Direct I/O allows reads and writes to files in a regular file system to bypass the page cache and access the file at near raw disk performance. Direct I/O can be advantageous when you are accessing a file in a manner where caching is of no benefit. For example, if you are copying a very large file from one disk to another, then it is likely that the file will not fit in memory and you will just cause the system to page heavily. By using direct I/O, you can copy the file through the file system without reading through the page cache and thereby eliminate both the memory pressure caused by the file system and the additional CPU cost of the layers of cache.
Direct I/O also eliminates the double copy that is performed when the read and write system calls are used. When we read a file through normal buffered I/O, the file system takes two steps: (1) It uses a DMA transfer from the disk controller into the kernel's address space and (2) it copies the data into the buffer supplied by the user in the read system call. Direct I/O eliminates the second step by arranging for the DMA transfer to occur directly into the user's address space.
Direct I/O bypasses the buffer cache only if all the following are true:
- - The file is not memory mapped.
- - The file does not have holes.
- - The read/write is sector aligned (512 byte).
- QFS. Support for direct I/O is the same as with UFS.
- NFS. NFS also supports direct I/O. With direct I/O enabled, NFS bypasses client-side caching and passes all requests directly to the NFS server. Both reads and writes are uncached and become synchronous (they need to wait for the server to complete). Unlike disk-based direct I/O support, NFS's support imposes no restrictions on I/O size or alignment; all requests are made directly to the server.
You enable direct I/O by mounting an entire file system with the force-directio mount option, as shown below.
# mount -o forcedirectio /dev/dsk/c0t0d0s6 /u1
You can also enable direct I/O for any file with the directio system call. Note that the change is file based, and every reader and writer of the file will be forced to use directio once it's enabled.
int directio(int fildes, DIRECTIO_ON | DIRECTIO_OFF); See sys/fcntl.h
Direct I/O can provide extremely fast transfers when moving data with big block sizes (>64 kilobytes), but it can be a significant performance limitation for smaller sizes. If an application reads and writes in small sizes, then its performance may suffer since there is no read-ahead or write clustering and no caching.
Databases are a good candidate for direct I/O since they cache their own blocks in a shared global buffer and can cluster their own reads and writes into larger operations.
A set of direct I/O statistics is provided with the ufs implementation by means of the kstat interface. The structure exported by ufs_directio_kstats is shown below. Note that this structure may change, and performance tools should not rely on the format of the direct I/O statistics.
struct ufs_directio_kstats { uint_t logical_reads; /* Number of fs read operations */ uint_t phys_reads; /* Number of physical reads */ uint_t hole_reads; /* Number of reads from holes */ uint_t nread; /* Physical bytes read */ uint_t logical_writes; /* Number of fs write operations */ uint_t phys_writes; /* Number of physical writes */ uint_t nwritten; /* Physical bytes written */ uint_t nflushes; /* Number of times cache was cleared */ } ufs_directio_kstats;
You can inspect the direct I/O statistics with a utility from our Web site at http://www.solarisinternals.com .
# directiostat 3 lreads lwrites preads pwrites Krd Kwr holdrds nflush 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5.6.3 The Directory Name Lookup Cache
The directory name cache caches path names for vnodes, so when we open a file that has been opened recently, we don't need to rescan the directory to find the file name. Each time we find the path name for a vnode, we store it in the directory name cache. (See Section 14.10 in Solaris™ Internals for further information on the DNLC operation.) The number of entries in the DNLC is set by the system-tuneable parameter, ncsize, which is set at boot time by the calculations shown in Table 5.1. The ncsize parameter is calculated in proportion to the maxusers parameter, which is equal to the number of megabytes of memory installed in the system, capped by a maximum of 1024. The maxusers parameter can also be overridden in /etc/system to a maximum of 2048.
Table 5.1. DNLC Default Sizes
Solaris Version |
Default ncsize Calculation |
Solaris 2.4, 2.5, 2.5.1 |
ncsize = (17 * maxusers) + 90 |
Solaris 2.6 onwards |
ncsize = (68 * maxusers) + 360 |
The size of the DNLC rarely needs to be adjusted, because the size scales with the amount of memory installed in the system. Earlier Solaris versions had a default maximum of 17498 (34906 with maxusers set to 2048), and later Solaris versions have a maximum of 69992 (139624 with maxusers set to 2048).
Use MDB to determine the size of the DNLC.
# mdb -k > ncsize/D ncsize: ncsize: 25520
The DNLC maintains housekeeping threads through a task queue. The dnlc_reduce_cache() activates the task queue when name cache entries reach ncsize, and it reduces the size to dnlc_nentries_low_water, which by default is one hundredth less than (or 99% of) ncsize. If dnlc_nentries reaches dnlc_max_nentries (twice ncsize), then we know that dnlc_reduce_cache() is failing to keep up. In this case, we refuse to add new entries to the dnlc until the task queue catches up. Below is an example of DNLC statistics obtained with the kstat command.
# vmstat -s 0 swap ins 0 swap outs 0 pages swapped in 0 pages swapped out 405332 total address trans. faults taken 1015894 page ins 353 page outs 4156331 pages paged in 1579 pages paged out 3600535 total reclaims 3600510 reclaims from free list 0 micro (hat) faults 405332 minor (as) faults 645073 major faults 85298 copy-on-write faults 117161 zero fill page faults 0 pages examined by the clock daemon 0 revolutions of the clock hand 4492478 pages freed by the clock daemon 3205 forks 88 vforks 3203 execs 33830316 cpu context switches 58808541 device interrupts 928719 traps 214191600 system calls 14408382 total name lookups (cache hits 90%) 263756 user cpu 462843 system cpu 14728521 idle cpu 2335699 wait cpu
The hit ratio of the directory name cache shows the number of times a name was looked up and found in the name cache. A high hit ratio (>90%) typically shows that the DNLC is working well. A low hit ratio does not necessarily mean that the DNLC is undersized; it simply means that we are not always finding the names we want in the name cache. This situation can occur if we are creating a large number of files. The reason is that a create operation checks to see if a file exists before it creates the file, causing a large number of cache misses.
The DNLC statistics are also available with kstat.
$ kstat -n dnlcstats module: unix instance: 0 name: dnlcstats class: misc crtime 208.832373709 dir_add_abort 0 dir_add_max 0 dir_add_no_memory 0 dir_cached_current 1 dir_cached_total 13 dir_entries_cached_current 880 dir_fini_purge 0 dir_hits 463 dir_misses 11240 dir_reclaim_any 8 dir_reclaim_last 3 dir_remove_entry_fail 0 dir_remove_space_fail 0 dir_start_no_memory 0 dir_update_fail 0 double_enters 6 enters 11618 hits 1347693 misses 10787 negative_cache_hits 76686 pick_free 0 pick_heuristic 0 pick_last 0 purge_all 1 purge_fs1 0 purge_total_entries 3013 purge_vfs 158 purge_vp 31 snaptime 94467.490008162
5.6.4 Block Buffer Cache
The buffer cache used in Solaris for caching of inodes and file metadata is now also dynamically sized. In old versions of UNIX, the buffer cache was fixed in size by the nbuf kernel parameter, which specified the number of 512-byte buffers. We now allow the buffer cache to grow by nbuf, as needed, until it reaches a ceiling specified by the bufhwm kernel parameter. By default, the buffer cache is allowed to grow until it uses 2% of physical memory. We can look at the upper limit for the buffer cache by using the sysdef command.
# sysdef * * Tunable Parameters * 7757824 maximum memory allowed in buffer cache (bufhwm) 5930 maximum number of processes (v.v_proc) 99 maximum global priority in sys class (MAXCLSYSPRI) 5925 maximum processes per user id (v.v_maxup) 30 auto update time limit in seconds (NAUTOUP) 25 page stealing low water mark (GPGSLO) 5 fsflush run rate (FSFLUSHR) 25 minimum resident memory for avoiding deadlock (MINARMEM) 25 minimum swapable memory for avoiding deadlock (MINASMEM)
Now that we only keep inode and metadata in the buffer cache, we don't need a very large buffer. In fact, we need only 300 bytes per inode and about 1 megabyte per 2 gigabytes of files that we expect to be accessed concurrently (note that this rule of thumb is for UFS file systems).
For example, if we have a database system with 100 files totaling 100 gigabytes of storage space and we estimate that we will access only 50 gigabytes of those files at the same time, then at most we would need 100 x 300 bytes = 30 kilobytes for the inodes and about 50 ÷ 2 x 1 megabyte = 25 megabytes for the metadata (direct and indirect blocks). On a system with 5 gigabytes of physical memory, the defaults for bufhwm would provide us with a bufhwm of 102 megabytes, which is more than sufficient for the buffer cache. If we are really memory misers, we could limit bufhwm to 30 megabytes (specified in kilobytes) by setting the bufhwm parameter in the /etc/system file. To set bufhwm smaller for this example, we would put the following line into the /etc/system file.
* * Limit size of bufhwm * set bufhwm=30000
You can monitor the buffer cache hit statistics by using sar -b. The statistics for the buffer cache show the number of logical reads and writes into the buffer cache, the number of physical reads and writes out of the buffer cache, and the read/write hit ratios.
# sar -b 3 333 SunOS zangief 5.7 Generic sun4u 06/27/99 22:01:51 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 22:01:54 0 7118 100 0 0 100 0 0 22:01:57 0 7863 100 0 0 100 0 0 22:02:00 0 7931 100 0 0 100 0 0 22:02:03 0 7736 100 0 0 100 0 0 22:02:06 0 7643 100 0 0 100 0 0 22:02:09 0 7165 100 0 0 100 0 0 22:02:12 0 6306 100 8 25 68 0 0 22:02:15 0 8152 100 0 0 100 0 0 22:02:18 0 7893 100 0 0 100 0 0
On this system we can see that the buffer cache is caching 100% of the reads and that the number of writes is small. This measurement was taken on a machine with 100 gigabytes of files that were being read in a random pattern. You should aim for a read cache hit ratio of 100% on systems with only a few, but very large, files (for example, database systems) and a hit ratio of 90% or better for systems with many files.
5.6.5 UFS Inode Cache
The UFS uses the ufs_ninode parameter to size the file system tables for the expected number of inodes. To understand how the ufs_ninode parameter affects the number of inodes in memory, we need to look at how the UFS maintains inodes. Inodes are created when a file is first referenced. They remain in memory much longer than when the file is last referenced because inodes can be in one of two states: either the inode is referenced or the inode is no longer referenced but is on an idle queue. Inodes are eventually destroyed when they are pushed off the end of the inode idle queue. Refer to Section 15.3.2 in Solaris™ Internals for a description of how ufs inodes are maintained on the idle queue.
The number of inodes in memory is dynamic. Inodes will continue to be allocated as new files are referenced. There is no upper bound to the number of inodes open at a time; if one million inodes are opened concurrently, then a little over one million inodes will be in memory at that point. A file is referenced when its reference count is non-zero, which means that either the file is open for a process or another subsystem such as the directory name lookup cache is referring to the file.
When inodes are no longer referenced (the file is closed and no other subsystem is referring to the file), the inode is placed on the idle queue and eventually freed. The size of the idle queue is controlled by the ufs_ninode parameter and is limited to one-fourth of ufs_ninode. The maximum number of inodes in memory at a given point is the number of active referenced inodes plus the size of the idle queue (typically, one-fourth of ufs_ninode). Figure 5.5 illustrates the inode cache.
Figure 5.5 In-Memory Inodes (Referred to as the "Inode Cache")
We can use the sar command and inode kernel memory statistics to determine the number of inodes currently in memory. sar shows us the number of inodes currently in memory and the number of inode structures in the inode slab cache. We can find similar information by looking at the buf_inuse and buf_total parameters in the inode kernel memory statistics.
# sar -v 3 3 SunOS devhome 5.7 Generic sun4u 08/01/99 11:38:09 proc-sz ov inod-sz ov file-sz ov lock-sz 11:38:12 100/5930 0 37181/37181 0 603/603 0 0/0 11:38:15 100/5930 0 37181/37181 0 603/603 0 0/0 11:38:18 101/5930 0 37181/37181 0 607/607 0 0/0 # kstat -n ufs_inode_cache ufs_inode_cache: buf_size 440 align 8 chunk_size 440 slab_size 8192 alloc 1221573 alloc_fail 0 free 1188468 depot_alloc 19957 depot_free 21230 depot_contention 18 global_alloc 48330 global_free 7823 buf_constructed 3325 buf_avail 3678 buf_inuse 37182 buf_total 40860 buf_max 40860 slab_create 2270 slab_destroy 0 memory_class 0 hash_size 0 hash_lookup_depth 0 hash_rescale 0 full_magazines 219 empty_magazines 332 magazine_size 15 alloc_from_cpu0 579706 free_to_cpu0 588106 buf_avail_cpu0 15 alloc_from_cpu1 573580 free_to_cpu1 571309 buf_avail_cpu1 25
The inode memory statistics show us how many inodes are allocated by the buf_inuse field. We can also see from the ufs inode memory statistics that the size of each inode is 440 bytes on this system See below to find out the size of an inode on different architectures.
# mdb -k Loading modules: [ unix krtld genunix specfs dtrace ...] > a$d radix = 10 base ten > ::sizeof inode_t sizeof (inode_t) = 0t276 > $q $ kstat unix::ufs_inode_cache:chunk_size module: unix instance: 0 name: ufs_inode_cache class: kmem_cache chunk_size 280
We can use this value to calculate the amount of kernel memory required for desired number of inodes when setting ufs_ninode and the directory name cache size.
The ufs_ninode parameter controls the size of the hash table that is used for inode lookup and indirectly sizes the inode idle queue (ufs_ninode ÷ 4). The inode hash table is ideally sized to match the total number of inodes expected to be in memory—a number that is influenced by the size of the directory name cache. By default, ufs_ninode is set to the size of the directory name cache, which is approximately the correct size for the inode hash table. In an ideal world, we could set ufs_ninode to four-thirds the size of the DNLC, to take into account the size of the idle queue, but practice has shown this to be unnecessary.
We typically set ufs_ninode indirectly by setting the directory name cache size (ncsize) to the expected number of files accessed concurrently, but it is possible to set ufs_ninode separately in /etc/system.
* Set number of inodes stored in UFS inode cache * set ufs_ninode = new_value
5.6.6 Monitoring UFS Caches with fcachestat
We can monitor all four key UFS caches by using a single Perl tool: fcachestat. This tool measures the DNLC, inode, UFS buffer cache (metadata), and page cache by means of segmap.
$ ./fcachestat 5 --- dnlc --- -- inode --- -- ufsbuf -- -- segmap -- %hit total %hit total %hit total %hit total 99.64 693.4M 59.46 4.9M 99.80 94.0M 81.39 118.6M 66.84 15772 28.30 6371 98.44 3472 82.97 9529 63.72 27624 21.13 12482 98.37 7435 74.70 14699 10.79 14874 5.64 16980 98.45 12349 93.44 11984 11.96 13312 11.89 14881 98.37 10004 93.53 10478 4.08 20139 5.71 25152 98.42 17917 97.47 16729 8.25 17171 3.57 20737 98.38 15054 93.64 11154 15.40 12151 6.89 13393 98.37 9403 93.14 11941 8.26 9047 4.51 10899 98.26 7861 94.70 7186 66.67 6 0.00 3 95.45 44 44.44 18