More About Filesystems
If we consider filesystems as a mechanism for both storing and locating data, then the two key elements for any filesystem are the items being stored and the list of where those items are. The deeper details of how a given filesystem manipulates its data and meta-information go beyond the scope of this chapter but are addressed further in Appendix B, "Anatomy of a Filesystem."
Filesystem Components That the Admin Needs to Know About
As always, we need to get a handle on the vocabulary before we can understand how the elements of a filesystem work together. The next three sections describe the basic components with which you, as a sysadmin, need to be familiar.
Files
The most intuitively obvious components of a filesystem are, of course, its files. Because everything in UNIX is a file, special functions are differentiated by file type. There are fewer file types than you might imagine, as Table 3.2 shows.
Table 3.2 File Types and Purposes, with Examples
File Type |
Purpose/Contents |
Examples |
Directory |
Maintains information for directory structure |
/ |
|
|
/usr |
|
|
/etc |
Block special |
Buffered device file |
Linux: /dev/hda1 |
|
|
Solaris: /dev/dsk/c0t0d0s0 |
Character special |
Raw device file |
Linux: /dev/tty0 |
|
|
Solaris: /dev/rdsk/c0t0d0s0 |
UNIX domain socket |
Interprocess communication (IPC) |
See output of commands for files Linux: netstat –x Solaris: netstat -f unix |
Named pipe special (FIFO device) |
First-in-first-out IPC mechanism, Invoked by name |
Linux: /dev/initctl Solaris: /etc/utmppipe/etc/cron.d/FIFO |
Symbolic link |
Pointer to another file (any type) |
/usr/tmp -> ../var/tmp |
Regular |
All other files; holds data of all other types |
Text files Object files Database files Executables/binaries |
Notice that directories are a type of file. The key is that they have a specific type of format and contents (see Appendix B for more details). A directory holds the filenames and index numbers (see the following section, "Inodes") of all its constituent files, including subdirectories.
Directory files are not flat (or regular) files, but are indexed (like a database), so that you can still locate a file quickly when you have a large number of files in the same directory.13
Even though file handling is generally transparent, it is important to remember that a file's data blocks14 may not be stored sequentially (or even in the same general disk region). When data blocks are widely scattered in an uncoordinated manner, it can affect access times and increase I/O overhead.
Inodes
Meta-information about files is stored in structures called index nodes, or inodes. Their contents vary based on the particular filesystem in use, but all inodes hold the following information about the file they index:15
Inode identification number
File type
Owners: user and group
UNIX permissions
File size
Timestamps
ctime: Last file status change time
-
mtime: Last data modification time16
atime: Last access time
Reference/link count
Physical location information for data blocks
Notice that the filename is not stored in the inode, but as an entry in the file's closest parent directory.
All other information about a file that ls displays is stored in an inode somewhere. With a few handy options, you can pull out lots of useful information. Let's say that you want to know the inode number of the Solaris kernel.17 You just give the –i option, and voilá:
[sun:10 ~]ls -i /kernel/genunix
264206 genunix
Of course, ls –l is an old friend, telling you most everything that you want to know. Looking at the Solaris kernel again, you get the output in Figure 3.4.
Figure 3.4 Diagrammed Output of ls on a File
Notice that the timestamp shown by default in a long listing is mtime. You can pass various options to ls to view ctime and atime instead. For other nifty permutations, see the ls man page.
File Permissions and Ownership Refresher
Because UNIX was designed to support many users, the question naturally arises how to know who can see what files. The first and simplest answer is simply to permit users to examine only their own files. This, of course, would make it difficult, if not impossible, to share, creating great difficulties in collaborative environments and causing a string of other problems: Why can't I run ls? Because the system created it, not you, is only the most obvious example of such problems.
Users and Groups
UNIX uses a three-part system to determine file access: There's what you, as the file owner, are allowed to do; there's what the group is allowed to do; and there's what other people are allowed to do. Let's see what Elvis's permissions look like:
[ elvis@frogbog elvis ]$ ls -l
total 36
drwxr-xr-x 5 elvis users 4096 Dec 9 21:55 Desktop
drwxr-xr-x 2 elvis users 4096 Dec 9 22:00 Mail
-rw-r--r-- 1 elvis users 36 Dec 9 22:00 README
-rw-r--r-- 1 elvis users 22 Dec 9 21:59 ThisFile
drwxr-xr-x 2 elvis users 4096 Dec 12 19:57 arc
drwxr-xr-x 2 elvis users 4096 Dec 10 00:40 songs
-rw-r--r-- 1 elvis users 46 Dec 12 19:52 tao.txt
-rw-r--r-- 1 elvis users 21 Dec 9 21:59 thisfile
-rw-r--r-- 1 elvis users 45 Dec 12 19:52 west.txt
As long as we're here, let's break down exactly what's being displayed. First, we have a 10-character string of letters and hyphens. This is the representation of permissions, which I'll break down in a minute. The second item is a number, usually a single digit. This is the number of hard links to that directory. I'll discuss this later in this chapter. The third thing is the username of the file owner, and the fourth is the name of the file's group. The fifth column is a number representing the size of the file, in bytes. The sixth contains the date and time of last modification for the file, and the final column shows the filename.
Every user on the system has a username and a number that is associated with that user. This number generally is referred to as the UID, short for user ID. If a user has been deleted but, for some reason, his files remain, the username is replaced with that user's UID. Similarly, if a group is deleted but still owns files, the GID (group number) shows up instead of a name in the group field. There are also other circumstances in which the system can't correlate the name and the number, but these should be relatively rare occurrences.
As a user, you can't change the owner of your files: This would open up some serious security holes on the system. Only root can chown files, but if he makes a mistake, you can now ask root to chown the files to you. As a user, you can chgrp a file to a different group of which you are a member. That is, if Elvis is a member of a group named users and a group named elvis, he can chgrp elvis west.txt or chgrp users west.txt, but because he's not a member of the group beatles, he can't chgrp beatles west.txt. A user can belong to any number of groups. Generally (although this varies somewhat by flavor), files created belong to the group to which the directory belongs. On most modern UNIX variants, the group that owns files is whatever group is listed as your primary group by the system in the /etc/passwd file and can be changed via the newgrp command. On these systems, Elvis can chgrp users if he wants his files to belong to the users group, or he can chgrp elvis if he wants his files to belong to the elvis group.
Reading Permissions
So, what were those funny strings of letters and hyphens at the beginning of each long directory listing? I already said that they represented the permissions of the file, but that's not especially helpful. The 10 characters of that string represent the permission bits for each file. The first character is separate, and the last nine are three very similar groups of three characters. I'll explain each of these in turn.
If you look back to Elvis's long listing of his directory, you'll see that most of the files simply have a hyphen as the first character, whereas several possess a d in this field. The more astute reader might note that the files with a d in that first field all happen to be directories. There's a good reason for this: The first permissions character denotes whether that file is a special file of one sort or another.
What's a special file? It's either something that isn't really a file (in the sense of a sequential stream of bytes on a disk) but that UNIX treats as a file, such as a disk or a video display, or something that is really a file but that is treated differently. A directory, by necessity, is a stream of bytes on disk, but that d means that it's treated differently.
The next three characters represent what the user who owns the file can do with it. From left to right, these permissions are read, write, and execute. Read permission is just that—the capability to see the contents of a file. Write permission implies not only the right to change the contents of a file, but also the right to delete it. If I do not possess write permission to a file, rm not_ permitted.txt fails.
Execute permission determines whether the file is also a command that can be run on the system. Because UNIX sees everything as a file, all commands are stored in files that can be created, modified, and deleted like any other file. The computer then needs a way to tell what can and can't be run. The execute bit does this.
Another important reason that you need to worry about whether a file is executable is that some programs are designed to be run only by the system administrator: These programs can modify the computer's configuration or can be dangerous in some other way. Because UNIX enables you to specify permissions for the owner, the group, and other users, the execute bit enables the administrator to restrict the use of dangerous programs.
Directories treat the execute permission differently. If a directory does not have execute permissions, that user (or group, or other users on the system) can't cd into that directory and can't look at information about the files in that directory. (You usually can find the names of the files, however.) Even if you have permissions for the files in that directory, you generally can't look at them. (This varies somewhat by platform.)
The second set of three characters is the group permissions (read, write, and execute, in that order), and the final set of three characters is what other users on the system are permitted to do with that file. Because of security concerns (either due to other users on your system or due to pervasive networks such as the Internet), giving write access to other users is highly discouraged.
Changing Permissions
Great, you can now read the permissions in the directory listing, but what can you do with them? Let's say that Elvis wants to make his directory readable only by himself. He can chmod go-rwx ~/songs: That means remove the read, write, and execute permissions for the group and others on the system. If Elvis decides to let Nashville artists take a look at his material but not change it (and if there's a group nashville on the system), he can first chgrp nashville songs and then chmod g+r songs.
If Elvis does this, however, he'll find that (at least, on some platforms) members of group nashville can't look at them. Oops! With a simple chmod g+x songs, the problem is solved:
[ elvis@frogbog elvis ]$ ls -l
total 36
drwxr-xr-x 5 elvis users 4096 Dec 9 21:55 Desktop
drwxr-xr-x 2 elvis users 4096 Dec 9 22:00 Mail
-rw-r--r-- 1 elvis users 36 Dec 9 22:00 README
-rw-r--r-- 1 elvis users 22 Dec 9 21:59 ThisFile
drwxr-xr-x 2 elvis users 4096 Dec 12 19:57 arc
drwxr-x--- 2 elvis nashvill 4096 Dec 15 14:21 songs
-rw-r--r-- 1 elvis users 46 Dec 12 19:52 tao.txt
-rw-r--r-- 1 elvis users 21 Dec 9 21:59 thisfile
-rw-r--r-- 1 elvis users 45 Dec 12 19:52 west.txt
Special Permissions
In addition to the read, write, and execute bits, there exists special permissions used by the system to determine how and when to suspend the normal permission rules. Any thorough understanding of UNIX requires an understanding of the setuid, setgid, and sticky bits. For normal system users, only a general understanding of these is necessary, and this discussion is thus brief. Good documentation on this subject exists elsewhere for budding system administrators and programmers.
setuid
The setuid bit applies only to executable files and directories. In the case of executable programs, it means that the given program runs as though the file owner were running it. That is, xhextris, a variant on Tetris, has the following permissions on my system:
-rwsr-xr-x
1 games games 32516 May 18 1999 /usr/X11R6/bin/xhextris
There's a pseudouser called games on the system, which can't be logged into and has no home directory. When the xhextris program executes, it can read and write to files that only the game's pseudouser normally would be permitted. In this case, there's a high-score file stored on the system that writeable only by that user. When Elvis runs the game, the system acts as though he were the user games, and thus he is able to store the high-score file. To set the setuid bit on a file, you can tell chmod to give it mode u+s. (You can think of this as uid set, although this isn't technically accurate.)
setgid
The setgid bit, which stands for "set group id," works almost identically to setuid, except that the system acts as though the user's group is that of the given file. If xhextris had used setgid games instead of setuid games, the high score would be writeable to any directory owned by the group games. It is used by the system administrator in ways fundamentally similar to the setuid permission.
When applied to directories on Linux, Irix, and Solaris (and probably most other POSIX-compliant UNIX flavors as well), the setgid bit means that new files are given the parent directory's group rather than the user's primary or current group. This can be useful for, say, a directory for fonts built by (and for) a given program. Any user might generate the fonts via a setgid command that writes to a setgid directory. setgid on directories varies by platform; check your documentation. To set the setgid bit, you can tell chmod to use g+s (gid set).
sticky
Although a file in a group or world-writeable directory without the sticky bit can be deleted by anyone with write permission for that directory (user, group, or other), a file in a directory with the sticky bit set can be deleted only by either the file's owner or root. This is particularly useful for creating temporary directories or scratch space that can be used by anyone without one's files being deleted by others. You can set permission +t in chmod to give something the sticky bit.
Numeric Permissions
Like almost everything else on UNIX, permissions have a number associated with them. It's generally considered that permissions are a group of four digits, each between 0 and 7. Each of those digits represents a group of three permissions, each of which is a yes/no answer. From left to right, those digits represent special permissions, user permissions, group permissions, and other permissions.
So, About Those Permission Bits...
Most programs reading permission bits expect four digits, although often only three are given. Shorter numbers are filled in with leading zeros: 222 is treated as 0222, and 5 is treated as 0005. The three rightmost digits are, as previously mentioned, user (owner) permissions, group permissions, and other permissions, from right to left.
Each of these digits is calculated in the following manner: read permission has a value of 4, write permission has a value of 2, and execute permission has a value of 1. Simply add these values together, and you've got that permission value. Read, write, and execute would be 7, read and write without execute would be 6, and no permission to do anything would be 0. Read, write, and execute for the file owner, with read and execute for the group and nothing at all for anyone else, would be 750. Read and write for the user and group, but only read for others, would be 664.
The special permissions are 4 for setuid, 2 for setgid, and 1 for sticky. This digit is prepended to the three-digit numeric permission: A temporary directory with sticky read, write, and execute permission for everyone would be mode 1777. A setuid root directory writeable by nobody else would be 4700. You can use chmod to set numeric permissions directly, as in chmod 1777 /tmp.
umask
In addition to a more precise use of chmod, numeric permissions are used with the umask command, which sets the default permissions. More precisely, it "masks" the default permissions: The umask value is subtracted from the maximum possible settings.* umask deals only with the three-digit permission, not the full-fledged four-digit value. A umask of 002 or 022 is most commonly the default. 022, subtracted from 777, is 755: read, write, and execute for the user, and read and execute for the group and others. 002 from 777 is 775: read, write, and execute for the user and group, and read and execute for others. I tend to set my umask to 077: read, write, and execute for myself, and nothing for my group or others. (Of course, when working on a group project, I set my umask to 007: My group and I can read, write, or execute anything, but others can't do anything with our files.)
You should note that the umask assumes that the execute bit on the file will be set. All umasks are subtracted from 777 rather than 666, and those extra ones are subtracted later, if necessary. (See Appendix B for more details on permission bits and umask workings.)
*Actually, the permission bits are XORed with the maximum possible settings, if you're a computer science type.
Also notice that the first bit of output prepended to the permissions string indicates the file type. This is one handy way of identifying a file's type. Another is the file command, as shown in Table 3.3.
Table 3.3 ls File Types and file Output Sample
File Type |
ls File Type Character |
File Display Example |
Directory |
d |
[either:1 ~]file /usr/usr: directory |
Block special device |
b |
[linux: 10 ~] file /dev/hda1/dev/hda1: block special (3/1)[sun:10 root ~]file /dev/dsk/c0t0d0s0/dev/dsk/c0t0d0s0: block special(136/0) |
Character special device |
c |
[linux:11 ~] file /dev/tty0/dev/tty0: character special (4/0) |
|
|
[ensis:11 ~]file /dev/rdsk/c0t0d0s0/dev/rdsk/c0t0d0s0: character special (136/0) |
UNIX domain socket |
s |
[linux:12 ~] file /dev/log/dev/log: socket |
|
|
[sun:12 ~]file /dev/ccv/dev/ccv: socket |
Named pipe special (FIFO device) |
p |
[linux:13 ~] file /dev/initctl/dev/initctl: fifo (named pipe) |
|
|
[sun:13 ~]file /etc/utmppipe/etc/utmppipe: fifo |
Symbolic link |
l |
[linux:14 ~] file /usr/tmp/usr/tmp: symbolic link to ../var/tmp |
|
|
[sun:14 ~]file -h /usr/tmp/usr/tmp: symbolic link to ¬../var/tmp |
Regular |
- |
[linux:15 ~] file /etc/passwd/etc/passwd: ASCII text |
|
|
[linux:15 ~] file /boot/vmlinux-2.4.2-2/boot/vmlinux-2.4.2-2: ELF 32-bit LSB executable, |
|
|
¬Intel 80386, version 1,statically linked, not stripped |
|
|
[linux:15 ~] file /etc/rc.d/init.d/sshd/etc/rc.d/init.d/sshd: Bourne-Again shell script text executable |
|
|
[sun:15 ~]file /etc/passwd |
|
|
/etc/passwd: ascii text |
|
|
[sun:15 ~]file /kernel/genunix |
|
|
-/kernel/genunix: ELF 32-bit MSB relocatable |
|
|
¬SPARC Version 1 |
|
|
[sun:15 ~]file /etc/init.d/sshd |
|
|
/etc/init.d/sshd: executable |
|
|
¬/sbin/sh script |
Notice the in-depth information that file gives—in many cases, it shows details about the file that no other command will readily display (such as what kind of executable the file is). These low-level details are beyond the scope of our discussion, but the man page has more information.
Important Points about the file ommand
file tries to figure out what type a file is based on three types of test:
The file type that the ls –l command returns.
-The presence of a magic number at the beginning of the file identifying the file type. These numbers are defined in the file /usr/share/magic in Red Hat Linux 7.1 and /usr/lib/locale/locale/LC_MESSAGES/magic (or /etc/magic) in Solaris 8. Typically, only binary files will have magic numbers.
-In the case of a regular/text file, the first few bytes are tested to determine the type of text representation and then to determine whether the file has a recognized purpose, such as C code or a Perl script.
file actually opens the file and changes the atime in the inode.
Inode lists are maintained by the filesystem itself, including which ones are free for use. Inode allocation and manipulation is all transparent to both sysadmins and users.
Inodes become significant at two times for the sysadmin: at filesystem creation time and when the filesystem runs out of free inodes. At filesystem creation time, the total number of inodes for the filesystem is allocated. Although they are not in use, space is set aside for them. You cannot add any more inodes to a filesystem after it has been created. When you run out of inodes, you must either free some up (by deleting or moving files) or migrate to another, larger filesystem.
Without inodes, files are just a random assortment of ones and zeros on the disk. There is no guarantee that the file will be stored sequentially within a sector or track, so without an inode to point the way to the data blocks, the file is lost. In fact, every file is uniquely identified by the combination of its filesystem name and inode number.
See Appendix B for more detailed information on the exact content of inodes and their structure.
Linux has a very useful command called stat that dumps the contents of an inode in a tidy format:
[linux:9 ~]stat . File: "." Size: 16384 Filetype: Directory Mode: (0755/drwxr-xr-x) Uid: (19529/ robin) Gid:(20/users) Device: 0,4 Inode: 153288707 Links: 78 Access: Sun Jul 22 13:58:29 2001(00009.04:37:59) Modify: Sun Jul 22 13:58:29 2001(00009.04:37:59) Change: Sun Jul 22 13:58:29 2001(00009.04:37:59)
Boot Block and Superblock
When a filesystem is created, two structures are automatically created, whether they are immediately used or not. The first is called the boot block, where boot-time information is stored. Because a partition may be made bootable at will, this structure needs to be available at all times.
The other structure, of more interest here, is the superblock. Just as an inode contains meta-information about a file, a superblock contains metainformation about a filesystem. Some of the more critical contents are listed here:18
Filesystem name
Filesystem size
Timestamp: last update
Superblock state flag
Filesystem state flag: clean, stable, active
Number of free blocks
List of free blocks
Pointer to next free block
Size of inode list
Number of free inodes
List of free inodes
Pointer to next free inode
Lock fields for free blocks and inodes
Summary data block
And you thought inodes were complex.
The superblock keeps track of free file blocks and free inodes so that the filesystem can store new files. Without these lists and pointers, a long, sequential search would have to be performed to find free space every time a file was created.
In much the same way that files without inodes are lost, filesystems without intact superblocks are inaccessible. That's why there is a superblock state flag—to indicate whether the superblock was properly and completely updated before the disk (or system) was last taken offline. If it was not, then a consistency check must be performed for the whole filesystem and the results stored back in the superblock.
Again, more detailed information about the superblock and its role in UNIX filesystems may be found in Appendix B.
Filesystem Types
Both Red Hat and Solaris recognize a multitude of different filesystem types, although you will generally end up using and supporting just a few. There are three standard types of filesystem—local, network, and pseudo—and a fourth "super-filesystem" type that is actually losing ground, given the size of modern disks.
Local Filesystems
Local filesystems are common to every system that has its own local disk.19 Although there are many instances of this type of filesystem, they are all designed to work within a system, managing the components discussed in the last section and interfacing with the physical drive(s).
Only a few local filesystems are specifically designed to be cross-platform (and sometimes even cross–OS-type). They come in handy, though, when you have a nondisk hardware failure; you can just take the disk and put it into another machine to retrieve the data.20 The UNIX File System, or ufs, was designed for this; both Solaris and Red Hat Linux machines can use disks with this filesystem. Note that Solaris uses ufs filesystems by default. Red Hat's default local filesystem is ext2.
Another local, cross-platform filesystem is ISO9660, the CD-ROM standard. This is why you can read your Solaris CD in a Red Hat box's reader.
Local filesystems come in two related but distinct flavors. The original, standard model filesystem is still in broad use today. The newer journaling filesystem type is just beginning to really come into its own. The major difference between the two types is the way they track changes and do integrity checks.
Standard Filesystems
Standard, nonjournaling filesystems rely on flags in the superblock for consistency regulation. If the superblock flag is not set to "clean," then the filesystem knows that it was not shut down properly: not all write buffers were flushed to disk, and so on. Inconsistency in a filesystem means that allocated inodes could be overwritten; free inodes could be counted as in use—in short, rampant file corruption, mass hysteria.
But there is a filesystem integrity checker to save the day: fsck. This command is usually invoked automatically at boot-time to verify that all filesystems are clean and stable. If the / or /usr filesystems are inconsistent, the system might prompt you to start up a miniroot shell and manually run fsck. A few of the more critical items checked and corrected are listed here:
Unclaimed blocks and inodes (not in free list or in use)
Unreferenced but allocated blocks and inodes
Multiply claimed blocks and inodes
Bad inode formats
Bad directory formats
Bad free block or inode list formats
Incorrect free block or inode counts
Superblock counts and flags
Note that a filesystem should be unmounted before running fsck (see the later section "Administering Local Filesystems"). Running fsck on a mounted filesystem might cause a system panic and crash, or it might simply refuse to run at all. It's also best, though not required, that you run fsck on the raw device, when possible. See the man page for more details and options.
So where does fsck put orphans, the blocks and inodes that are clearly in use but aren't referenced anywhere? Enter the lost+found directories. There is always a /lost+found directory on every system; other directories accrue them as fsck finds orphans in their purview. fsck automatically creates the directories as needed and renames the lost blocks into there by inode number. See the man pages "mklost+found" on Red Hat and "fsck_ufs" on Solaris.
Journaling Filesystems
Journaling filesystems do away with fsck and its concomitant superblock structures. All filesystem state information is internally tracked and monitored, in much the same way that databases systems set up checkpoints and self-verifications.
With journaling filesystems, you have a better chance of full data recovery in the event of a system crash. Even unsaved data in buffers can be recovered thanks to the internal log.21 This kind of fault tolerance makes journaling filesystems useful in high- availability environments.
The drawback, of course, is that when a filesystem like this gets corrupted somehow, it presents major difficulties for recovery. Most journaling filesystems provide their own salvaging programs for use in case of emergency. This underscores how critical backups are, no matter what kind of filesystem software you've invested in. See Chapter 16, "Backups," for more information.
One of the earliest journaling filesystems is still a commercial venture: VxFS by Veritas. Another pioneer has decided to release its software into the public domain under GPL22 licensing: JFS23 by IBM. SGI's xfs journaling filesystem has been freely available under GPL since about 1999, although it is only designed to work under IRIX and Linux.24
Maintenance of filesystem state incurs an overhead when using journaling filesystems. As a result, these filesystems perform suboptimally for small filesystem sizes. Generally, journaling filesystems are appropriate for filesystem sizes of 500Mb or more.
Network Filesystems
Network-based filesystems are really add-ons to local filesystems because the file server must have the actual data stored in one of its own local filesystems.25 Network file- systems have both a server and client program.
The server usually runs as a daemon on the system that is sharing disk space. The server's local filesystems are unaffected by this extra process. In fact, the daemon generally only puts a few messages in the syslog and is otherwise only visible through ps.
The system that wants to access the server's disk space runs the client program to mount the shared filesystems across the network. The client program handles all the I/O so that the network filesystem behaves just a like a local filesystem toward the client machine.
The old standby for network-based filesystems is the Network File System (NFS). The NFS standard is currently up to revision 3, though there are quite a number of implementations with their own version numbers. Both Red Hat and Solaris come standard with NFS client and server packages. For more details on the inner workings and configuration of NFS, see Chapter 13, "File Sharing."
Other network-based filesystems include AFS (IBM's Andrew File System) and DFS/DCE (Distributed File System, part of the Open Group's Distributed Computing Environment). The mechanisms of these advanced filesystems go beyond the scope of this book, although their goal is still the same: to efficiently share files across the network transparently to the user.
Pseudo Filesystems
Pseudofilesystems are an interesting development in that they are not actually related to disk-based partitions. They are instead purely logical constructs that represent information and meta-information in a hierarchical structure. Because of this structure and because they can be manipulated with the mount command, they are still referred to as filesystems.
The best example of pseudofilesystems exists on both Red Hat and Solaris systems: /proc. Under Solaris, /proc is restricted to just managing process information:
[sun:1 ~]ls /proc 0 145 162 195 206 230 262 265 272 286 299 303 342 370 403 408 _672 752 1 155 185 198 214 243 263 266 278 292 3 318 360 371 404 52 _674 142 157 192 2 224 252 264 268 280 298 302 319 364 400 406 58 _678
Note that these directories are all named according to the process numbers corresponding to what you would find in the output of ps. The contents of each directory are the various meta-information that the system needs to manage the process.
Under Red Hat, /proc provides information about processes as well as about various system components and statistics:
[linux:1 ~] ls /proc 1 18767 23156 24484 25567 28163 4 493 674 8453 ksyms _stat 13557 18933 23157 24486 25600 3 405 5 675 9833 loadavg _swaps 13560 18934 23158 24487 25602 3050 418 5037 676 9834 locks _sys 13561 18937 23180 24512 25603 3051 427 5038 7386 9835 mdstat _tty 1647 19709 23902 24541 25771 3052 441 5054 7387 bus meminfo _uptime 1648 19730 23903 24775 25772 30709 455 5082 7388 cmdline misc _version 1649 19732 23936 25494 25773 30710 473 510 7414 cpuinfo modules 16553 19733 24118 25503 25824 30712 485 5101 7636 devices mounts 18658 2 24119 25504 25882 30729 486 524 7637 dma mtrr 18660 21450 24120 25527 25920 320 487 558 7638 filesystems net 18661 21462 24144 25533 26070 335 488 6 7662 fs _partitions 18684 21866 24274 25534 26071 337 489 670 8426 interrupts pci 18685 21869 24276 25541 26072 338 490 671 8427 ioports scsi 18686 21870 24277 25542 28161 339 491 672 8428 kcore self 18691 21954 24458 25543 28162 365 492 673 8429 kmsg slabinfo
Again we see the directories named for process numbers, but we also see directories with indicative names such as cpuinfo and loadavg. Because this is a hierarchical filesystem, you can cd into these directories and read the various files for their system information.
The most interesting thing about /proc is that it allows even processes to be treated like files.26 This means that pretty much everything in UNIX, whether it is something that just exists or something that actually happens, can now be considered a file.
For more information under Red Hat, type man proc. For more information under Solaris, type man –s 4 proc.
Logical Volumes
Finally, there are the "super-filesystems" or logical volumes that do what the other major types of filesystem cannot: surmount the barriers of partitions. You may well ask why anyone would want to do that. There are two reasons. First, because disks used to be a lot smaller and more costly, you used what you had at hand. If you needed a large pool of disk space, logical volumes allowed you to aggregate remnants into something useable. Second, even with larger disks, you still might not be able to achieve the kind of disk space required by a particular researcher or program. Once again, logical volumes allow you to aggregate partitions across disks to form one large filesystem.
Crossing disk boundaries with a logical volume is referred to as disk spanning. Once you have logical volumes, you can also have some fairly complex data management methods and performance-enhancing techniques. Disk striping, for example, is a performance booster. Instead of sequentially filling one disk and then the next in series, it spreads the data in discrete chunks across disks, allowing better I/O response through parallel operations.
RAID27 implements logical volumes at 10 distinct levels, with various features at each level. This implementation can be done either in hardware or in software, although the nomenclature for both is the same.28
Table 3.4 RAID Levels
RAID Level |
Features |
Implications |
0 |
Disk striping |
Fastest |
|
|
Not self-repairing |
1 |
Disk mirroring |
Fast |
|
|
Self-repairing |
|
|
-Requires extra drives for data duplication |
2 |
Disk striping |
Fast |
|
Error correction |
Self-repairing |
|
|
(Very similar to RAID-3) |
3 |
Disk striping |
Slower |
|
Parity disk |
Self-repairing |
|
Error correction |
Requires separate parity disk |
4 |
Disk striping |
Slower |
|
Parity disk |
Self-repairing |
|
|
Requires separate parity disk |
|
|
(Very similar to RAID-5) |
5 |
Disk striping |
Slowest for writes, but |
|
Rotating parity array |
good for reads |
|
|
Self-repairing |
|
|
Requires three to five separate parity disks |
|
|
Reconstruction by parity data (not duplication) |
6 |
RAID-5 + secondary |
Not in broad use |
|
parity scheme |
|
7 |
RAID-5 + real-time embedded controller |
Not in broad use |
0+1 |
Mirrored striping |
-RAID-0 array duplicated (mirrored) |
1+0 |
Striped mirroring |
-Each stripe is RAID-1 (mirrored) array |
|
|
High cost |
0+3 |
Array of parity stripes |
Each stripe is RAID-3 array |
|
|
High cost |
Clearly, the kind of complexity inherent in all logical volume systems requires some kind of back-end management system. Red Hat offers the Logical Volume Manager (LVM) as a kernel module. While the details of LVM are beyond the scope of this book, it is interesting to note that you can put any filesystem that you want on top of the logical volume. Start at http://www.linuxdoc.org/HOWTO/LVM-HOWTO.html for more details.
Although Sun offers logical volume management, it is through a for-pay program called "Solstice DiskSuite." The filesystem on DiskSuite logical volumes must be ufs. For more information, start at http://docs.sun.com/ab2/coll.260.2/DISKSUITEREF.
Another commercial logical volume manager for Solaris comes from Veritas; see: http://www.veritas.com/us/products/volumemanager/faq.html#a24
The beauty of all logical volumes is that they appear to be just another local filesystem and are completely transparent to the user. However, logical volumes do add some complexity for the systems administrator, and the schema should be carefully documented on paper, in case it needs to be re-created.
NAS
Normally, a file server's disks are directly attached to the file server. With network-attached storage (NAS), the file server and the disks that it serves are separate entities, communicating over the local network. The storage disks require an aggregate controller that arbitrates file I/O requests from the external server(s). The server(s) and the aggregate controller each have distinct network IP addresses. To serve the files to clients, a file (or application) server sends file I/O requests to the NAS aggregate controller and relays the results back to client systems.
NAS is touched on here for completeness—entire books can be written about NAS design and implementation. NAS does not really represent a type of filesystem, but rather it is a mechanism to relieve the file server from the details of hardware disk access by isolating them in the network-attached storage unit.
Red Hat Filesystem Reference Table
Table 3.5 lists major filesystems that currently support (or are supported by) Red Hat.29 The filesystem types that are currently natively supported are listed in /usr/src/linux/ fs/filesytems.c.
Table 3.5 Filesystem Types and Purposes, with Examples (Red Hat)
Filesystem Type |
Specific Instances (as Used in /etc/fstab) |
Purpose |
Local |
ext2 |
Red Hat default filesystem |
|
ufs |
Solaris compatibility |
|
jfs |
Journaling filesystem from IBM |
|
xfs |
Journaling filesystem from SGI |
|
msdos |
Windows compatibility: DOS |
|
ntfs |
Windows compatibility: NT |
|
vfat |
Windows compatibility: FAT-32 |
|
sysv |
SYS-V compatibility |
|
iso9660 |
CD-ROM |
|
Adfs hfs romfs |
Others |
|
Affs hpfs smbfs |
|
|
Coda mnix udf |
|
|
devpts ncpfs umsdos |
|
|
efs qux4 |
|
|
coherent |
Deprecated, pre-kernel 2.1.21 |
|
ext |
|
|
xenix |
|
|
xiafs |
|
Network |
afs |
Network-based remote communication |
|
autofs |
|
|
nfs |
|
Pseudo |
proc |
Store process (and other system) meta-information |
Solaris Filesystem Reference Table
Table 3.6 lists major filesystems that currently support (or are supported by) Solaris. The filesystem types that currently are natively supported are listed as directories under /usr/lib/fs.
Table 3.6 Filesystem Types and Purposes, with Examples (Solaris)
Filesystem Type |
Specific Instances (as Used in /etc/vfstab) |
Purpose |
Local |
ufs |
Solaris default filesystem; Red Hat-compatible |
|
pcfs |
PC filesystem |
|
hsfs |
CD-ROM |
|
jfs |
Journaling filesystem from IBM |
Network |
afs |
Network-based remote communication |
|
nfs |
|
Pseudo |
procfs |
Store process metainformation |
|
Fdfs swapfs tmpfs |
Mount metainformation areas as filesystems |
|
mntfs cachefs lofs |
|
|
fifofs specfs udfs namefs |
|