Processing Text
Course Objectives Covered
-
Executing Commands at the Command Line (3036)
-
Common Command Line Tasks (3036)
-
Piping and Redirection (3036)
-
Creating, Viewing, and Appending Files (3036)
The simplest text processing utility of all is cat, a derivative of the word concatenate. By default, it will display the entire contents of a file on the screen (standard output). However, a number of useful options can be used with it, including the following:
-
-b to number lines
-
-E to show a dollar sign ($) at the end of each line (carriage return)
-
-T to show all tabs as "^I"
-
-v to show nonprinting characters except tabs and carriage returns
-
-A to show the same as -v combined with -E and -T
To illustrate the uses of cat, assume that there is a four-line file named example with the following contents:
How much wood could a woodchuck chuck if a woodchuck could chuck wood?
To view the contents of the file on the screen, exactly as they appear in the preceding example, the command is
cat example
To view the file with lines numbered, the command, and the output generated, will be
cat -b example 1 How much wood 2 could a woodchuck chuck 3 if a woodchuck 4 could chuck wood?
Note the inclusion of the tab characters that were not there before, but were added by the numbering process. They are not truly in the file, but only added to the display, as can be witnessed with the following command:
cat -Ab example 1 How much wood$ 2 could a woodchuck chuck$ 3 if a woodchuck$ 4 could chuck wood?$
The only nonprintable characters within the file are the carriage returns at the end, which appear as dollar signs.
One of the most common uses of the cat utility is to quickly create a text file. From the command line, you can specify no file at all to display and redirect the output to a given filename. This then accepts keyboard input and places it in the new file until the end-of-file character is received (the key sequence is Ctrl+D, by default).
The following example includes a dollar sign ($) prompt to show this operation in process:
$ cat > example Peter Piper picked a peck of pickled peppers A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where's the peck of pickled peppers Peter Piper picked? {press Ctrl+D} $
The Ctrl+D sequence is pressed on a line by itself and signifies the end of the file. Viewing the contents of the directory (via the ls utility) will show that the file has now been created, and its contents can be viewed like this:
cat example
Note that the single redirection (>) creates a file named example if it did not exist before, and overwrites it if did. To add to an existing file, use the append character (>>).
NOTE
The Ctrl+D keyboard sequence is the typical default for specifying an end-of-file operation. Like almost everything in Linux, this can be changed, customized, and so on. To see the settings for your session, use this command:
stty -a
and look for "eof = ".
There is a utility of use in limited circumstancestacwhich will display the contents of files in reverse order (tac is cat in reverse order). Instead of displaying a file from line 1 to the end of the file, it shows the file from the end of the file to line 1, as illustrated in the following example:
$ tac example Where's the peck of pickled peppers Peter Piper picked? If Peter Piper picked a peck of pickled peppers, A peck of pickled peppers Peter Piper picked. Peter Piper picked a peck of pickled peppers $
nl, head, and tail
Three simple commands can be used to view all or parts of files: nl, head, and tail. The first, nl, is used to number the lines, and is similar to cat -b. Both will number the lines of display, and by default, neither will number blank lines. There are certain options that nl can utilize to alter the display:
-
-i allows you to change the increment (default is 1).
-
-v allows you to change the starting number (default 1).
-
-n changes the alignment of the display:
-
-nln aligns the display on the left.
-
-nrn aligns the display on the right.
-
-nrz uses leading zeros.
-
--s uses a specified character between the line number and the text (default is a space).
The second utility to examine is head. As the name implies, this utility is used to look at the top portion of a file: by default, the first 10 lines. You can change the number of lines displayed by using a dash followed by the number of lines to display. The following examples assume there is a text file named numbers with 200 lines in it counting from "one" to "two hundred":
$ head numbers one two three four five six seven eight nine ten $ $ head -3 numbers one two three $ $ head -50 numbers one two three {skipping for space purposes} forty-eight forty-nine fifty $
NOTE
When printing multiple files, head places a header before each listing identifying what file it is displaying. The -q option suppresses the headers.
The tail command has several modes in which it can operate. By default, it is the opposite of head, and shows the end of file rather than the beginning. Once again, it defaults to the number 10 to display, but that can be changed by using the dash and a number:
$ tail numbers one hundred ninety-one one hundred ninety-two one hundred ninety-three one hundred ninety-four one hundred ninety-five one hundred ninety-six one hundred ninety-seven one hundred ninety-eight one hundred ninety-nine two hundred $ $ tail -3 numbers one hundred ninety-eight one hundred ninety-nine two hundred $ $ tail -50 numbers one hundred fifty-one one hundred fifty-two one hundred fifty-three {skipping for space purposes} one hundred ninety-eight one hundred ninety-nine two hundred $
The tail utility goes beyond this functionality, however, by including a plus (+) option. This allows you to specify a starting point beyond which you will see the entire file. For example
$ tail +50 numbers
This will start with line 50 (skipping the first 49) and display all the rest of the file151 lines in this case. Another useful option is -f, which allows you to follow a file. The command
$ tail -f numbers
will display the last 10 lines of the file, but then stay openfollowing the fileand display any new lines that are appended to the file. To break out of the endless monitoring loop, you must press the interrupt key sequence, which is Ctrl+C by default on most systems.
NOTE
To find the interrupt key sequence for your session, use the command
stty -a
and look for "intr = ".
cut, paste, and join
The ability to separate columns that could constitute data fields from a file is provided by the cut utility. The default delimiter used is the tab, and the -f option is used to specify the desired field. For example, suppose there is a text file named august with three columns, looking like this:
one two three four five six seven eight nine ten eleven twelve
Then the following command
cut -f2 august
will return
two five eight eleven
However, the following example
cut -f1,3 august
will return the opposite:
one three four six seven nine ten twelve
A number of options are available with this command; the two to be familiar with (besides -f) are -c and -d:
-
-c allows you to specify characters instead of fields.
-
-d allows you to specify a delimiter other than the tab.
To illustrate how to use the other options, the ls -l command will show: permissions, number of links, owner, group, size, date, and filenameall separated by whitespace, with two characters between the permissions and links. If you only want to see who is saving files in the directory, and are not interested in the other data, you can use
ls -l | cut -d" " -f5
This will ignore the permissions (first field), two sets of whitespace (second and third fields), number of links (fourth field), and display the owner (fifth field), ignoring everything following. Another way to look at this is that with ls -l the permissions always take up 10 characters, followed by whitespace of 3 characters, then the number of links, and whitespace that follows. The owner always begins with the 16th character and continues for the length of the name. The command
ls -l | cut -c16
will return the 16th characterthe first letter of the owner's name. If an assumption is made that most users will use eight characters or less for their name, the command
ls -l | cut -c16-24
will return those entries in the name field.
The name of the file begins with the 55th character, but it can be impossible to determine how many characters after that to take because some filenames will be considerably longer than others. A solution to this is to begin with the 55th character, and not specify an ending character (meaning that the entire rest of the line is taken), as in this example:
ls -l | cut -c55-
Paste
Whereas the cut utility extracts fields from a file, they can be combined using either paste or join. The simplest of the two is pasteit has no great feature sets at all and merely takes one line from one source and combines it with another line from another source. For example, if the contents of fileone are
Indianapolis Columbus Peoria Livingston Scottsdale
And the contents of filetwo are
Indiana Ohio Illinois Montana Arizona
Then the following (including prompts) would be the display generated:
$ paste fileone filetwo Indianapolis Indiana Columbus Ohio Peoria Illinois Livingston Montana Scottsdale Arizona $
If there were more lines in fileone than filetwo, the pasting would continue, but with blank entries following the tab. The tab character is always the default delimiter, but that can be changed to anything by using the -d option:
$ paste -d"," fileone filetwo Indianapolis,Indiana Columbus,Ohio Peoria,Illinois Livingston,Montana Scottsdale,Arizona $
You can also use the -s option to output all of fileone on a single line, followed by a carriage return and then filetwo:
$ paste -s fileone filetwo Indianapolis Columbus Peoria Livingston Scottsdale Indiana Ohio Illinois Montana Arizona $
Join
You can think of the join utility as a greatly enhanced version of paste. It is critically important, however, to know that the utility can only work if the files being joined share a common field. For example, if join were used in the same example as paste was earlier, the result would be
$ join fileone filetwo $
In other words, there is no display. join must find a common field between the files in question and, by default, expects that common field to be the first. For example, assume that fileone now contains these entries:
11111 Indianapolis 22222 Columbus 33333 Peoria 44444 Livingston 55555 Scottsdale
And the contents of filetwo are
11111 Indiana 500 race 22222 Ohio Buckeye State 33333 Illinois Wrigley Field 44444 Montana Yellowstone Park 55555 Arizona Grand Canyon
Then the following (including prompts) would be the display generated:
$ join fileone filetwo 11111 Indianapolis Indiana 500 race 22222 Columbus Ohio Buckeye State 33333 Peoria Illinois Wrigley Field 44444 Livingston Montanta Yellowstone Park 55555 Scottsdale Arizona Grand Canyon $
The commonality of the first field was identified and the matching entries were combined. Whereas paste blindly took from each file to create the display, join will only combine lines that match andof critical importanceit must be an exact match with the corresponding line in the other file. This point cannot be illustrated enough; for example, suppose filetwo had an additional line in the middle:
11111 Indiana 500 race 22222 Ohio Buckeye State 66666 Tennessee Smokey Mountains 33333 Illinois Wrigley Field 44444 Montana Yellowstone Park 55555 Arizona Grand Canyon
Then the following (including prompts) would be the display generated:
$ join fileone filetwo 11111 Indianapolis Indiana 500 race 22222 Columbus Ohio Buckeye State $
As soon as the files no longer match, no further operations can be carried out. Each line is checked with the sameand only the sameline in the opposite file for a match on the default field. If matches are found, they are incorporated in the display; otherwise they are not. To illustrate one more timeusing the original filetwo:
$ tac filetwo > filethree $ join fileone filethree 55555 Scottsdale Arizona Grand Canyon $
Even though a match exists for every line in both files, only one match is found.
NOTE
It is highly recommended that you overcome problems with join by first sorting each of the files to be used to get them in like order.
You don't have to keep the defaults with join from looking at only the first fields for matches or from outputting all columns. The -1 option lets you specify what field to use as the matching field in fileone, whereas the -2 option lets you specify what field to use as the matching field in filetwo. For example, if the second field of fileone were to match with the third field of filetwo, the syntax would be
$ join -1 2 -2 3 fileone filethree
The -o option is used to specify output fields in the format {file.field}. Thus to only print the second field of fileone and the third field of filetwo on matching lines, the syntax would be
$ join -o 1.2 2.3 fileone filethree Indianapolis 500 Columbus Buckeye Peoria Wrigley Livingston Yellowstone Scottsdale Grand $
Sort, Count, Format, and Translate
It is often necessary to not only display text, but to manipulate and modify it a bit before the output is shown, or simply gather information on it. Four utilities are examined in this section: sort, wc, fmt, and tr.
sort
The sort utility sorts the lines of a file in alphabetical order, and displays the output. The importance of alphabetical order, versus any other, cannot be overstated. For example, assume that the fileone file contains the following lines:
Indianapolis Indiana Columbus Peoria Livingston Scottsdale 1 2 3 4 5 6 7 8 9 10 11 12
When a sort is done on the file, the result becomes
$ sort fileone 1 10 11 12 2 3 4 5 6 7 8 9 Columbus Indianapolis Indiana Livingston Peoria Scottsdale $
The cities are "correctly" sorted in alphabetical order. The numbers, however, are also in alphabetical order, which puts every number starting with "1" before every number starting with "2," and then every number starting with "3," and so on.
Thankfully, the sort utility includes some options to add a great deal of flexibility to the output. Among those options are the following:
-
-d to sort in phone directory order (the same as that shown in the preceding example)
-
-f to sort lowercase letters the same as uppercase
-
-i to ignore any characters outside the ASCII range
-
-n to sort in numerical order versus alphabetical
-
-r to reverse the order of the output
Thus the display can be changed to
$ sort -n fileone Columbus Indianapolis Indiana Livingston Peoria Scottsdale 1 2 3 4 5 6 7 8 9 10 11 12 $
NOTE
The sort utility assumes all blank lines to be a part of the display and always places them at the beginning of the output. To prevent blank lines from being sorted, use the -b option.
wc
The wc utility (named for "word count") displays information about the file in terms of three values: number of lines, words, and characters. The last entry in the output is the name of the file, thus the output would be
$ wc fileone 17 18 86 fileone $
You can choose to see only some of the output by using the following options:
-
-c to show only the number of bytes/characters
-
-l to see only the number of lines
-
-w to see only the number of words
In all cases, the name of the file still appears, for example
$ wc -l fileone 17 fileone $
The only way to override the name appearing is by using the standard input redirection:
$ wc -l < fileone 17 $
fmt
The fmt utility formats the text by creating output to a specific width. The default width is 75 characters, but a different value can be specified with the -w option. Short lines are combined to create longer ones unless the -s option is used, and spacing is justified unless -u is used. The -u option enforces uniformity and places one space between words and two spaces at the end of each sentence.
The following example shows how the fileone lines are combined to create a 75-character display:
$ fmt fileone Indianapolis Indiana Columbus Peoria Livingston Scottsdale 1 2 3 4 5 6 7 8 9 10 11 12 $
To change the output to 60 characters, use this example:
$ fmt -w60 fileone Indianapolis Indiana Columbus Peoria Livingston Scottsdale 1 2 3 4 5 6 7 8 9 10 11 12 $
NOTE
The default for any option with fmt is -w, thus fmt -60 fileone will give the same result as fmt -w60 fileone.
tr
The tr (translate) utility can convert one set of characters to another. Use the following example to change all lowercase characters to uppercase:
$ tr '[a-z]' '[A-Z]' < fileone INDIANAPOLIS INDIANA COLUMBUS PEORIA LIVINGSTON SCOTTSDALE $
NOTE
It is extremely important to realize that the syntax of tr only accepts two character sets, not the name of the file. You must feed the name of the file into the utility by directing input (as in the example given), by piping to it (|), or using a similar operation.
Not only can you give character sets as string options, but you can also specify a number of unique values, including
-
lowerAll lowercase characters
-
upperAll uppercase characters
-
printAll printable characters
-
punctPunctuation characters
-
spaceAll whitespace (blank can be used for horizontal whitespace only)
-
alnumAlpha characters and numbers
-
digitNumbers only
-
cntrlControl characters
-
alphaLetters only
-
graphPrintable characters but not whitespace
For example, the output shown earlier can also be obtained like this:
$ tr '[:lower:]' '[:upper:]' < fileone INDIANAPOLIS INDIANA COLUMBUS PEORIA LIVINGSTON SCOTTSDALE $
Other Useful Utilities
A number of other useful text utilities are included with Linux. Some of these have limited usefulness and are intended only for a specific purpose, but are given because knowing of their existence and purpose can make your life with Linux considerably easier.
In alphabetical order, the additional utilities are as follows:
-
expandAllows you to expand tab characters into spaces. The default number of spaces per tab is 8, but this can be changed using the -t option. The opposite of this utility is unexpand.
-
fileThis utility will look at an entry's signature and report what type of file it isASCII text, GIF image, and so on. The definitions it returns (and thus the files it can correctly identify) are defined in a file called magic. This file typically resides in /usr/share/misc or /etc.
-
moreUsed to display only one screen of output at a time.
-
odCan perform an octal dump to show the contents of files other than ASCII text files. Used with the -x option, it does a hexadecimal dump, and with the -c option, it shows only recognizable ASCII characters.
-
prConverts the file into a format suitable for printed pages including a default header with date and time of last modification, filename, and page numbers. The default header can be overwritten with the -h option, and the -l option allows you to specify the number of lines to include on each pagethe default is 66. Default page width is 72 characters, but a different value can be specified with the -w option. The -d option can be used to double-space the output, and -m can be used to print numerous files in column format.
-
splitChops a single file into multiple files. The default is that a new file is created for every 1,000 lines of the original file. Using the -b option, you can avoid the thousand-line splitting and specify a number of bytes to be put into each output file, or use -l to specify a number of lines.
-
uniqThis utility will examine entries in a file, comparing the current line with the one directly preceding it, to find lines that are unique.