1.6. Americanise—Files, Maps, and Closures
To have any practical use a programming language must provide some means of reading and writing external data. In previous sections we had a glimpse of Go’s versatile and powerful print functions from its fmt package; in this section we will look at Go’s basic file handling facilities. We will also look at some more advanced features such as Go’s treatment of functions and methods as first-class values which makes it possible to pass them as parameters. And in addition we will make use of Go’s map type (also known as a data dictionary or hash).
This section provides enough of the basics so that programs that read and write text files can be written—thus making the examples and exercises more interesting. Chapter 8 provides much more coverage of Go’s file handling facilities.
By about the middle of the twentieth century, American English surpassed British English as the most widely used form of English. In this section’s example we will review a program that reads a text file and writes out a copy of the file into a new file with any words using British spellings replaced with their
U.S. counterparts. (This doesn’t help with differences in semantics or idioms, of course.) The program is in the file americanise/americanise.go, and we will review it top-down, starting with its imports, then its main() function, then the functions that main() calls, and so on.
All the americanise program’s imports are from Go’s standard library. Packages can be nested inside one another without formality, as the io package’s ioutil package and the path package’s filepath package illustrate.
The bufio package provides functions for buffered I/O, including ones for reading and writing strings from and to UTF-8 encoded text files. The io package provides low-level I/O functions—and the io.Reader and io.Writer interfaces we need for the americanise() program. The io/ioutil package provides high-level file handling functions. The regexp package provides powerful regular expression support. The other packages (fmt, log, filepath, and strings) have been mentioned in earlier sections.
The main() function gets the input and output filenames from the command line, creates corresponding file values, and then passes the files to the americanise() function to do the work.
The function begins by retrieving the names of the files to read and write and an error value. If there was a problem parsing the command line we print the error (which contains the program’s usage message), and terminate the program. Some of Go’s print functions use reflection (introspection) to print a value using the value’s Error() string method if it has one, or its String() string method if it has one, or as best they can otherwise. If we provide our own custom types with one of these methods, Go’s print functions will automatically be able to print values of our custom types, as we will see in Chapter 6.
If err is nil, we have inFilename and outFilename strings (which may be empty), and we can continue. Files in Go are represented by pointers to values of type os.File, and so we create two such variables initialized to the standard input and output streams (which are both of type *os.File). Since Go functions and methods can return multiple values it follows that Go supports multiple assignments such as the ones we have used here (30 ←, ➊, ➌).
Each filename is handled in essentially the same way. If the filename is empty the file has already been correctly set to os.Stdin or os.Stdout (both of which are of type *os.File, i.e., a pointer to an os.File value representing the file); but if the filename is nonempty we create a new *os.File to read from or write to the file as appropriate.
The os.Open() function takes a filename and returns an *os.File value that can be used for reading the file. Correspondingly, the os.Create() function takes a filename and returns an *os.File value that can be used for reading or writing the file, creating the file if it doesn’t exist and truncating it to zero length if it does exist. (Go also provides the os.OpenFile() function that can be used to exercise complete control over the mode and permissions used to open a file.)
In fact, the os.Open(), os.Create(), and os.OpenFile() functions return two values: an *os.File and nil if the file was opened successfully, or nil and an error if an error occurred.
If err is nil we know that the file was successfully opened so we immediately execute a defer statement to close the file. Any function that is the subject of a defer statement (§5.5, → 210) must be called—hence the parentheses after the functions’ names (30 ←, ➍, →)—but the calls only actually occur when the function in which the defer statements are written returns. So the defer statement “captures” the function call and sets it aside for later. This means that the defer statement itself takes almost no time at all and control immediately passes to the following statement. Thus, the deferred os.File.Close() method won’t actually be called until the enclosing function—in this case, main()—returns (whether normally or due to a panic, discussed in a moment), so the file is open to be worked on and yet guaranteed to be closed when we are finished with it, or if a panic occurs.
If we fail to open the file we call log.Fatal() with the error. As we noted in a previous section, this function logs the date, time, and error (to os.Stderr unless another log destination is specified), and calls os.Exit() to terminate the program. When os.Exit() is called (directly, or by log.Fatal()), the program is terminated immediately—and any pending deferred statements are lost. This is not a problem, though, since Go’s runtime system will close any open files, the garbage collector will release the program’s memory, and any decent database or network that the application might have been talking to will detect the application’s demise and respond gracefully. Just the same as with the bigdigits example, we don’t use log.Fatal() in the first if statement (30 ←, ➋), because the err contains the program’s usage message and we want to print this without the date and time that the log.Fatal() function normally outputs.
In Go a panic is a runtime error (rather like an exception in other languages). We can cause panics ourselves using the built-in panic() function, and can stop a panic in its tracks using the recover() function (§5.5, → 210). In theory, Go’s panic/recover functionality can be used to provide a general-purpose exception handling mechanism—but doing so is considered to be poor Go practice. The Go way to handle errors is for functions and methods to return an error value as their sole or last return value—or nil if no error occurred—and for callers to always check the error they receive. The purpose of panic/recover is to deal with genuinely exceptional (i.e., unexpected) problems and not with normal errors.12
With both files successfully opened (the os.Stdin, os.Stdout, and os.Stderr files are automatically opened by the Go runtime sytem), we call the americanise() function to do the processing, passing it the files on which to work. If americanise() returns nil the main() function terminates normally and any deferred statements—in this case, ones that close the inFile and outFile if they are not os.Stdin and os.Stdout—are executed. And if err is not nil, the error is printed, the program is exited, and Go’s runtime system closes any open files.
The americanise() function accepts an io.Reader and an io.Writer, not *os.Files, but this doesn’t matter since the os.File type supports the io.ReadWriter interface (which simply aggregates the io.Reader and io.Writer interfaces) and can therefore be used wherever an io.Reader or an io.Writer is required. This is an example of duck typing in action—the americanise() function’s parameters are interfaces, so the function will accept any values—no matter what their types—that satisfy the interfaces, that is, any values that have the methods the interfaces specify. The americanise() function returns nil, or an error if an error occurred.
The filenamesFromCommandLine() function returns two strings and an error value—and unlike the functions we have seen so far, here the return values are given variable names, not just types. Return variables are set to their zero values (empty strings and nil for err in this case) when the function is entered, and keep their zero values unless explicitly assigned to in the body of the function. (We will say a bit more on this topic when we discuss the americanise() function, next.)
The function begins by seeing if the user has asked for usage help.13 If they have, we create a new error value using the fmt.Errorf() function with a suitable usage string, and return immediately. As usual with Go code, the caller is expected to check the returned error and behave accordingly (and this is exactly what main() does). The fmt.Errorf() function is like the fmt.Printf() function we saw earlier, except that it returns an error value containing a string using the given format string and arguments rather than writing a string to os.Stdout. (The errors.New() function is used to create an error given a literal string.)
If the user did not request usage information we check to see if they entered any command-line arguments, and if they did we set the inFilename return variable to their first command-line argument and the outFilename return variable to their second command-line argument. Of course, they may have given no command-line arguments, in which case both inFilename and outFilename remain empty strings; or they may have entered just one, in which case inFilename will have a filename and outFilename will be empty.
At the end we do a simple sanity check to make sure that the user doesn’t overwrite the input file with the output file, exiting if necessary—but if all is well, we return.14 Functions or methods that return one or more values must have at least one return statement. It can be useful for clarity, and for godoc-generated documentation, to give variable names for return types, as we have done in this function. If a function or method has variable names as well as types listed for its return values, then a bare returnis legal (i.e., a returnstatement that does not specify any variables). In such cases, the listed variables’ values are returned. We do not use bare returns in this book because they are considered to be poor Go style.
Go takes a consistent approach to reading and writing data that allows us to read and write to files, to buffers (e.g., to slices of bytes or to strings), and to the standard input, output, and error streams—or to our own custom types—so long as they provide the methods necessary to satisfy the reading and writing interfaces.
For a value to be readable it must satisfy the io.Reader interface. This interface specifies a single method with signature, Read([]byte) (int, error). The Read() method reads data from the value it is called on and puts the data read into the given byte slice. It returns the number of bytes read and an error value which will be nil if no error occurred, or io.EOF (“end of file”) if no error occurred and the end of the input was reached, or some other non-nil value if an error occurred. Similarly, for a value to be writable it must satisfy the io.Writer interface. This interface specifies a single method with signature, Write([]byte) (int, error). The Write() method writes data from the given byte slice into the value the method was called on, and returns the number of bytes written and an error value (which will be nil if no error occurred).
The io package provides readers and writers but these are unbuffered and operate in terms of raw bytes. The bufiopackage provides buffered input/output where the input will work on any value that satisfies the io.Reader interface (i.e., provides a suitable Read() method), and the output will work on any value that satisfies the io.Writer interface (i.e., provides a suitable Write() method). The bufio package’s readers and writers provide buffering and can work in terms of bytes or strings, and so are ideal for reading and writing UTF-8 encoded text files.
The americanise() function buffers the inFile reader and the outFile writer. Then it reads lines from the buffered reader and writes each line to the buffered writer, having replaced any British English words with their U.S. equivalents.
The function begins by creating a buffered reader and a buffered writer through which their contents can be accessed as bytes—or more conveniently in this case, as strings. The bufio.NewReader() construction function takes as argument any value that satisfies the io.Reader interface (i.e., any value that has a suitable Read() method) and returns a new buffered io.Reader that reads from the given reader. The bufio.NewWriter() function is synonymous. Notice that the americanise() function doesn’t know or care what it is reading from or writing to—the reader and writer could be compressed files, network connections, byte slices ([]byte), or anything else that supports the io.Reader and io.Writer interfaces.
This way of working with interfaces is very flexible and makes it easy to compose functionality in Go.
Next we create an anonymous deferred function that will flush the writer’s buffer before the americanise() function returns control to its caller. The anonymous function will be called when americanise() returns normally—or abnormally due to a panic. If no error has occurred and the writer’s buffer contains unwritten bytes, the bytes will be written before americanise() returns. Since it is possible that the flush will fail we set the err return value to the result of the writer.Flush() call. A less defensive approach would be to have a much simpler defer statement of defer writer.Flush() to ensure that the writer is flushed before the function returns and ignoring any error that might have occurred before the flush—or that occurs during the flush.
Go allows the use of named return values, and we have taken advantage of this facility here (err error), just as we did previously in the filenamesFromCommand-Line() function. Be aware, however, that there is a subtle scoping issue we must consider when using named return values. For example, if we have a named return value of value, we can assign to it anywhere in the function using the assignment operator (=) as we’d expect. However, if we have a statement such as if valuee := ..., because the ifstatement starts a new block, the value in the ifstatement will be a new variable, so the if statement’s value variable will shadow the return value variable. In the americanise() function, err is a named return value, so we have made sure that we never assign to it using the short variable declaration operator (:=) to avoid the risk of accidentally creating a shadow variable. One consequence of this is that we must declare the other variables we want to assign to at the same time, such as the replacer function (35 ←, ➊) and the line we read in (35 ←, ➋). An alternative approach is to avoid named return values and return the required value or values explicitly, as we have done elsewhere.
One other small point to note is that we have used the blank identifier, _ (35 ←, ➌). The blank identifier serves as a placeholder for where a variable is expected in an assignment, and discards any value it is given. The blank identifier is not considered to be a new variable, so if used with :=, at least one other (new) variable must be assigned to.
The Go standard library contains a powerful regular expression package called regexp (§3.6.5, ← 118). This package can be used to create pointers to regexp.Regexp values (i.e., of type *regexp.Regexp). These values provide many methods for searching and replacing. Here we have chosen to use the regexp.Regexp.ReplaceAllStringFunc() method which given a string and a “replacer” function with signature func(string) string, calls the replacer function for every match, passing in the matched text, and replacing the matched text with the text the replacer function returns.
If we had a very small replacer function, say, one that simply uppercased the words it matched, we could have created it as an anonymous function when we called the replacement function. For example:
However, the americanise program’s replacer function, although only a few lines long, requires some preparation, so we have created another function, makeReplacerFunction(), that given the name of a file that contains lines of original and replacement words, returns a replacer function that will perform the appropriate replacements.
If the makeReplacerFunction() returns a non-nil error, we return and the caller is expected to check the returned error and respond appropriately (as it does).
Regular expressions can be compiled using the regexp.Compile() function which returns a *regexp.Regexp and nil, or nil and error if the regular expression is invalid. This is ideal for when the regular expression is read from an external source such as a file or received from the user. Here, though, we have used the regexp.MustCompile() function—this simply returns a *regexp.Regexp, or panics if the regular expression, or “regexp”, is invalid. The regular expression used in the example matches the longest possible sequence of one or more English alphabetic characters.
With the replacer function and the regular expression in place we start an infinite loop that begins by reading a line from the reader. The bufio.Reader.Read-String() method reads (or, strictly speaking, decodes) the underlying reader’s raw bytes as UTF-8 encoded text (which also works for 7-bit ASCII) up to and including the specified byte (or up to the end of the file). The function conveniently returns the text as a string, along with an error (or nil).
If the error returned by the call to the bufio.Reader.ReadString() method is not nil, either we have reached the end of the input or we have hit a problem. At the end of the input err will be io.EOF which is perfectly okay, so in this case we set err to nil (since there isn’t really an error), and set eof to true to ensure that the loop finishes at the next iteration, so we won’t attempt to read beyond the end of the file. We don’t return immediately we get io.EOF, since it is possible that the file’s last line doesn’t end with a newline, in which case we will have received a line to be processed, in addition to the io.EOF error.
For each line we call the regexp.Regexp.ReplaceAllStringFunc() method, giving it the line and the replacer function. We then try to write the (possibly modified) line to the writer using the bufio.Writer.WriteString()method—this method accepts a string and writes it out as a sequence of UTF-8 encoded bytes, returning the number of bytes written and an error(which will be nil if no error occurred). We don’t care how many bytes are written so we assign the number to the blank identifier, ‘_ ’. If err is not nil we return immediately, and the caller will receive the error.
Using bufio’s reader and writer as we have done here means that we can work with convenient high level string values, completely insulated from the raw bytes which represent the text on disk. And, of course, thanks to our deferred anonymous function, we know that any buffered bytes are written to the writer when the americanise() function returns, providing that no error has occurred.
The makeReplacerFunction() takes the name of a file containing original and replacement strings and returns a function that given an original string returns its replacement, along with an error value. It expects the file to be a UTF-8 encoded text file with one whitespace-separated original and replacement word per line.
In addition to the bufio package’s readers and writers, Go’s io/ioutil package provides some high level convenience functions including the ioutil.ReadFile() function used here. This function reads and returns the entire file’s contents as raw bytes (in a []byte) and an error. As usual, if the error is not nil we immediately return it to the caller—along with a nil replacer function. If we read the bytes okay, we convert them to a string using a Go conversion of form type(variable). Converting UTF-8 bytes to a string is very cheap since Go’s strings use the UTF-8 encoding internally. (Go’s string conversions are covered in Chapter 3.)
The replacer function we want to create must accept a string and return a corresponding string, so what we need is a function that uses some kind of lookup table. Go’s built-in map collection data type is ideal for this purpose (§4.3, → 162). A map holds key–value pairs with very fast lookup by key. So here we will store British words as keys and their U.S. counterparts as values.
Go’s map, slice, and channel types are created using the built-in make() function. This creates a value of the specified type and returns a reference to it. The reference can be passed around (e.g., to other functions) and any changes made to the referred-to value are visible to all the code that accesses it. Here we have created an empty map called usForBritish, with string keys and string values.
With the map in place we then split the file’s text (which is in the form of a single long string) into lines, using the strings.Split() function. This function takes a string to split and a separator string to split on and does as many splits as possible. (If we want to limit the number of splits we can use the strings.SplitN() function.)
The iteration over the lines uses a for loop syntax that we haven’t seen before, this time using a range clause. This form can be conveniently used to iterate over a map’s keys and values, over a communication channel’s elements, or—as here—over a slice’s (or array’s) elements. When used on a slice (or array), the slice index and the element at that index are returned on each iteration, starting at index 0 (if the slice is nonempty). In this example we use the loop to iterate over all the lines, but since we don’t care about the index of each line we assign it to the blank identifier (_) which discards it.
We need to split each line into two: the original string and the replacement string. We could use the strings.Split() function but that would require us to specify an exact separator string, say, "", which might fail on a hand-edited file where sometimes users accidentally put in more than one space, or sometimes use tabs. Fortunately, Go provides the strings.Fields() function which splits the string it is given on whitespace and is therefore much more forgiving of human-edited text.
If the fields variable (of type []string) has exactly two elements we insert the corresponding key–value pair into the map. Once the map is populated we are ready to create the replacer function that we will return to the caller.
We create the replacer function as an anonymous function given as an argument to the return statement—along with a nil error value. (Of course, we could have been less succinct and assigned the anonymous function to a variable and returned the variable.) The function has the exact signature required by the regexp.Regexp.ReplaceAllStringFunc() method that it will be passed to.
Inside the anonymous replacer function all we do is look up the given word. If we access a map element with one variable on the left-hand side, that variable is set to the corresponding value—or to the value type’s zero value if the given key isn’t in the map. If the map value type’s zero value is a legitimate value, then how can we tell if a given key is in the map? Go provides a syntax for this case—and that is generally useful if we simply want to know whether a particular key is in the map—which is to put two variables on the left-hand side, the first to accept the value and the second to accept a bool indicating if the key was found. In this example we use this second form inside an if statement that has a simple statement (a short variable declaration), and a condition (the found Boolean). So we retrieve the usWord (which will be an empty string if the given word isn’t a key in the map), and a found flag of type bool. If the British word was found we return the U.S. equivalent; otherwise we simply return the original word unchanged.
There is a subtlety in the makeReplacerFunction() function that may not be immediately apparent. In the anonymous function created inside it we access the usForBritish map, yet this map was created outside the anonymous function. This works because Go supports closures (§5.6.3, → 223). A closure is a function that “captures” some external state—for example, the state of the function it is created inside, or at least any part of that state that the closure accesses. So here, the anonymous function that is created inside the makeReplacerFunction() is a closure that has captured the usForBritish map.
Another subtlety is that the usForBritish map is a local variable and yet we will be accessing it outside the function in which it is declared. It is perfectly fine to return local variables in Go. Even if they are references or pointers, Go won’t delete them while they are in use and will garbage-collect them when they are finished with (i.e., when every variable that holds, refers, or points to them has gone out of scope).
This section has shown some basic low-level and high-level file handling functionality using os.Open(), os.Create(), and ioutil.ReadFile(). In Chapter 8 there is much more file handling coverage, including the writing and reading of text, binary, JSON, and XML files. Go’s built-in collection types—slices and maps—largely obviate the need for custom collection types while providing extremely good performance and great convenience. Go’s collection types are covered in Chapter 4. Go’s treatment of functions as first-class values in their own right and its suppport for closures makes it possible to use some advanced and every useful programming idioms. And Go’s defer statement makes it straightforward to avoid resource leakage.