Linux System Calls
- Using strace
- access: Testing File Permissions
- fcntl: Locks and Other File Operations
- fsync and fdatasync: Flushing Disk Buffers
- getrlimit and setrlimit: Resource Limits
- getrusage: Process Statistics
- gettimeofday: Wall-Clock Time
- The mlock Family: Locking Physical Memory
- mprotect: Setting Memory Permissions
- nanosleep: High-Precision Sleeping
- readlink: Reading Symbolic Links
- sendfile: Fast Data Transfers
- setitimer: Setting Interval Timers
- sysinfo: Obtaining System Statistics
- uname
So far, we've presented a variety of functions that your program can invoke to perform system-related functions, such as parsing command-line options, manipulating processes, and mapping memory. If you look under the hood, you'll find that these functions fall into two categories, based on how they are implemented.
A library function is an ordinary function that resides in a library external to your program. Most of the library functions we've presented so far are in the standard C library, libc. For example, getopt_long and mkstemp are functions provided in the C library.
A call to a library function is just like any other function call. The arguments are placed in processor registers or onto the stack, and execution is transferred to the start of the function's code, which typically resides in a loaded shared library.
A system call is implemented in the Linux kernel. When a program makes a system call, the arguments are packaged up and handed to the kernel, which takes over execution of the program until the call completes. A system call isn't an ordinary function call, and a special procedure is required to transfer control to the kernel. However, the GNU C library (the implementation of the standard C library provided with GNU/Linux systems) wraps Linux system calls with functions so that you can call them easily. Low-level I/O functions such as open and read are examples of system calls on Linux.
The set of Linux system calls forms the most basic interface between programs and the Linux kernel. Each call presents a basic operation or capability.
Some system calls are very powerful and can exert great influence on the system. For instance, some system calls enable you to shut down the Linux system or to allocate system resources and prevent other users from accessing them. These calls have the restriction that only processes running with superuser privilege (programs run by the root account) can invoke them. These calls fail if invoked by a nonsuperuser process.
Note that a library function may invoke one or more other library functions or system calls as part of its implementation.
Linux currently provides about 200 different system calls. A listing of system calls for your version of the Linux kernel is in /usr/include/asm/unistd.h. Some of these are for internal use by the system, and others are used only in implementing specialized library functions. In this chapter, we'll present a selection of system calls that are likely to be the most useful to application and system programmers.
Most of these system calls are declared in <unistd.h>.
8.1 Using strace
Before we start discussing system calls, it will be useful to present a command with which you can learn about and debug system calls. The strace command traces the execution of another program, listing any system calls the program makes and any signals it receives.
To watch the system calls and signals in a program, simply invoke strace, followed by the program and its command-line arguments. For example, to watch the system calls that are invoked by the hostname 1 command, use this command:
% strace hostname
This produces a couple screens of output. Each line corresponds to a single system call. For each call, the system call's name is listed, followed by its arguments (or abbreviated arguments, if they are very long) and its return value. Where possible, strace conveniently displays symbolic names instead of numerical values for arguments and return values, and it displays the fields of structures passed by a pointer into the system call. Note that strace does not show ordinary function calls.
In the output from strace hostname, the first line shows the execve system call that invokes the hostname program: 2
execve("/bin/hostname", ["hostname"], [/* 49 vars */]) = 0
The first argument is the name of the program to run; the second is its argument list, consisting of only a single element; and the third is its environment list, which strace omits for brevity. The next 30 or so lines are part of the mechanism that loads the standard C library from a shared library file.
Toward the end are system calls that actually help do the program's work. The uname system call is used to obtain the system's hostname from the kernel,
uname({sys="Linux", node="myhostname", ...}) = 0
Observe that strace helpfully labels the fields (sys and node) of the structure argument. This structure is filled in by the system call—Linux sets the sys field to the operating system name and the node field to the system's hostname. The uname call is discussed further in Section 8.15, "uname."
Finally, the write system call produces output. Recall that file descriptor 1 corresponds to standard output. The third argument is the number of characters to write, and the return value is the number of characters that were actually written.
write(1, "myhostname\n", 11) = 11
This may appear garbled when you run strace because the output from the hostname program itself is mixed in with the output from strace.
If the program you're tracing produces lots of output, it is sometimes more convenient to redirect the output from strace into a file. Use the option -o filename to do this.
Understanding all the output from strace requires detailed familiarity with the design of the Linux kernel and execution environment. Much of this is of limited interest to application programmers. However, some understanding is useful for debugging tricky problems or understanding how other programs work.