Java Input/Output (I/O) Streams
In Java, all input and output is handled through input/output streams. These streams let you deal, in a uniform manner, with communications among various sources of data, such as files, network connections, or memory blocks. This chapter includes detailed coverage of the reader and writer classes that make it easy to deal with Unicode. It shows you what goes on under the hood when you use the object serialization mechanism, which makes saving and loading objects easy and convenient and then moves on to regular expressions and working with files and paths.
Save 35% off the list price* of the related book or multi-format eBook (EPUB + MOBI + PDF) with discount code ARTICLE.
* See informit.com/terms
In this chapter, we cover the Java Application Programming Interfaces (APIs) for input and output. You will learn how to access files and directories and how to read and write data in binary and text format. This chapter also shows you the object serialization mechanism that lets you store objects as easily as you can store text or numeric data. Next, we will turn to working with files and directories. We finish the chapter with a discussion of regular expressions, even though they are not actually related to input and output. We couldn’t find a better place to handle that topic, and apparently neither could the Java team—the regular expression API specification was attached to a specification request for “new I/O” features.
2.1 Input/Output Streams
In the Java API, an object from which we can read a sequence of bytes is called an input stream. An object to which we can write a sequence of bytes is called an output stream. These sources and destinations of byte sequences can be—and often are—files, but they can also be network connections and even blocks of memory. The abstract classes InputStream and OutputStream are the basis for a hierarchy of input/output (I/O) classes.
Byte-oriented input/output streams are inconvenient for processing information stored in Unicode (recall that Unicode uses multiple bytes per character). Therefore, a separate hierarchy provides classes, inheriting from the abstract Reader and Writer classes, for processing Unicode characters. These classes have read and write operations that are based on two-byte char values (that is, UTF-16 code units) rather than byte values.
2.1.1 Reading and Writing Bytes
The InputStream class has an abstract method:
abstract int read()
This method reads one byte and returns the byte that was read, or -1 if it encounters the end of the input source. The designer of a concrete input stream class overrides this method to provide useful functionality. For example, in the FileInputStream class, this method reads one byte from a file. System.in is a predefined object of a subclass of InputStream that allows you to read information from “standard input,” that is, the console or a redirected file.
The InputStream class also has nonabstract methods to read an array of bytes or to skip a number of bytes. Since Java 9, there is a very useful method to read all bytes of a stream:
byte[] bytes = in.readAllBytes();
There are also methods to read a given number of bytes—see the API notes.
These methods call the abstract read method, so subclasses need to override only one method.
Similarly, the OutputStream class defines the abstract method
abstract void write(int b)
which writes one byte to an output location.
If you have an array of bytes, you can write them all at once:
byte[] values = . . .; out.write(values);
The transferTo method transfers all bytes from an input stream to an output stream:
in.transferTo(out);
Both the read and write methods block until the byte is actually read or written. This means that if the input stream cannot immediately be accessed (usually because of a busy network connection), the current thread blocks. This gives other threads the chance to do useful work while the method is waiting for the input stream to become available again.
The available method lets you check the number of bytes that are currently available for reading. This means a fragment like the following is unlikely to block:
int bytesAvailable = in.available(); if (bytesAvailable < 0) { var data = new byte[bytesAvailable]; in.read(data); }
When you have finished reading or writing to an input/output stream, close it by calling the close method. This call frees up the operating system resources that are in limited supply. If an application opens too many input/output streams without closing them, system resources can become depleted. Closing an output stream also flushes the buffer used for the output stream: Any bytes that were temporarily placed in a buffer so that they could be delivered as a larger packet are sent off. In particular, if you do not close a file, the last packet of bytes might never be delivered. You can also manually flush the output with the flush method.
Even if an input/output stream class provides concrete methods to work with the raw read and write functions, application programmers rarely use them. The data that you are interested in probably contain numbers, strings, and objects, not raw bytes.
Instead of working with bytes, you can use one of many input/output classes that build upon the basic InputStream and OutputStream classes.
2.1.2 The Complete Stream Zoo
Unlike C, which gets by just fine with a single type FILE*, Java has a whole zoo of more than 60 (!) different input/output stream types (see Figures 2.1 and 2.2).
FIGURE 2.1 Input and output stream hierarchy
FIGURE 2.2 Reader and writer hierarchy
Let’s divide the animals in the input/output stream zoo by how they are used. There are separate hierarchies for classes that process bytes and characters. As you saw, the InputStream and OutputStream classes let you read and write individual bytes and arrays of bytes. These classes form the basis of the hierarchy shown in Figure 2.1. To read and write strings and numbers, you need more capable subclasses. For example, DataInputStream and DataOutputStream let you read and write all the primitive Java types in binary format. Finally, there are input/output streams that do useful stuff; for example, the ZipInputStream and ZipOutputStream let you read and write files in the familiar ZIP compression format.
For Unicode text, on the other hand, you can use subclasses of the abstract classes Reader and Writer (see Figure 2.2). The basic methods of the Reader and Writer classes are similar to those of InputStream and OutputStream.
abstract int read() abstract void write(int c)
The read method returns either a UTF-16 code unit (as an integer between 0 and 65535) or -1 when you have reached the end of the file. The write method is called with a Unicode code unit. (See Volume I, Chapter 3 for a discussion of Unicode code units.)
There are four additional interfaces: Closeable, Flushable, Readable, and Appendable (see Figure 2.3). The first two interfaces are very simple, with methods
FIGURE 2.3 The Closeable, Flushable, Readable, and Appendable interfaces
void close() throws IOException
and
void flush()
respectively. The classes InputStream, OutputStream, Reader, and Writer all implement the Closeable interface.
OutputStream and Writer implement the Flushable interface.
The Readable interface has a single method
int read(CharBuffer cb)
The CharBuffer class has methods for sequential and random read/write access. It represents an in-memory buffer or a memory-mapped file. (See Section 2.5.2, “The Buffer Data Structure,” on p. 132 for details.)
The Appendable interface has two methods for appending single characters and character sequences:
Appendable append(char c) Appendable append(CharSequence s)
The CharSequence interface describes basic properties of a sequence of char values. It is implemented by String, CharBuffer, StringBuilder, and StringBuffer.
Of the input/output stream classes, only Writer implements Appendable.
2.1.3 Combining Input/Output Stream Filters
FileInputStream and FileOutputStream give you input and output streams attached to a disk file. You need to pass the file name or full path name of the file to the constructor. For example,
var fin = new FileInputStream("employee.dat");
looks in the user directory for a file named employee.dat.
Like the abstract InputStream and OutputStream classes, these classes only support reading and writing at the byte level. That is, we can only read bytes and byte arrays from the object fin.
byte b = (byte) fin.read();
As you will see in the next section, if we just had a DataInputStream, we could read numeric types:
DataInputStream din = . . .; double x = din.readDouble();
But just as the FileInputStream has no methods to read numeric types, the DataInputStream has no method to get data from a file.
Java uses a clever mechanism to separate two kinds of responsibilities. Some input streams (such as the FileInputStream and the input stream returned by the openStream method of the URL class) can retrieve bytes from files and other more exotic locations. Other input streams (such as the DataInputStream) can assemble bytes into more useful data types. The Java programmer has to combine the two. For example, to be able to read numbers from a file, first create a FileInputStream and then pass it to the constructor of a DataInputStream.
var fin = new FileInputStream("employee.dat"); var din = new DataInputStream(fin); double x = din.readDouble();
If you look at Figure 2.1 again, you can see the classes FilterInputStream and FilterOutputStream. The subclasses of these classes are used to add capabilities to input/output streams that process bytes.
You can add multiple capabilities by nesting the filters. For example, by default, input streams are not buffered. That is, every call to read asks the operating system to dole out yet another byte. It is more efficient to request blocks of data instead and store them in a buffer. If you want buffering and the data input methods for a file, use the following rather monstrous sequence of constructors:
var din = new DataInputStream( new BufferedInputStream( new FileInputStream("employee.dat")));
Notice that we put the DataInputStream last in the chain of constructors because we want to use the DataInputStream methods, and we want them to use the buffered read method.
Sometimes you’ll need to keep track of the intermediate input streams when chaining them together. For example, when reading input, you often need to peek at the next byte to see if it is the value that you expect. Java provides the PushbackInputStream for this purpose.
var pbin = new PushbackInputStream( new BufferedInputStream( new FileInputStream("employee.dat")));
Now you can speculatively read the next byte
int b = pbin.read();
and throw it back if it isn’t what you wanted.
if (b != ‘<’) pbin.unread(b);
However, reading and unreading are the only methods that apply to a pushback input stream. If you want to look ahead and also read numbers, then you need both a pushback input stream and a data input stream reference.
var din = new DataInputStream( pbin = new PushbackInputStream( new BufferedInputStream( new FileInputStream("employee.dat"))));
Of course, in the input/output libraries of other programming languages, niceties such as buffering and lookahead are automatically taken care of—so it is a bit of a hassle to resort, in Java, to combining stream filters. However, the ability to mix and match filter classes to construct useful sequences of input/output streams does give you an immense amount of flexibility. For example, you can read numbers from a compressed ZIP file by using the following sequence of input streams (see Figure 2.4):
FIGURE 2.4 A sequence of filtered input streams
var zin = new ZipInputStream(new FileInputStream("employee.zip")); var din = new DataInputStream(zin);
(See Section 2.2.3, “ZIP Archives,” on p. 85 for more on Java’s handling of ZIP files.)
2.1.4 Text Input and Output
When saving data, you have the choice between binary and text formats. For example, if the integer 1234 is saved in binary, it is written as the sequence of bytes 00 00 04 D2 (in hexadecimal notation). In text format, it is saved as the string "1234". Although binary I/O is fast and efficient, it is not easily readable by humans. We first discuss text I/O and cover binary I/O in Section 2.2, “Reading and Writing Binary Data,” on p. 78.
When saving text strings, you need to consider the character encoding. In the UTF-16 encoding that Java uses internally, the string "José" is encoded as 00 4A 00 6F 00 73 00 E9 (in hex). However, many programs expect that text files use a different encoding. In UTF-8, the encoding most commonly used on the Internet, the string would be written as 4A 6F 73 C3 A9, without the zero bytes for the first three letters and with two bytes for the é character.
The OutputStreamWriter class turns an output stream of Unicode code units into a stream of bytes, using a chosen character encoding. Conversely, the InputStreamReader class turns an input stream that contains bytes (specifying characters in some character encoding) into a reader that emits Unicode code units.
For example, here is how you make an input reader that reads keystrokes from the console and converts them to Unicode:
var in = new InputStreamReader(System.in);
This input stream reader assumes the default character encoding used by the host system. On desktop operating systems, that can be an archaic encoding such as Windows 1252 or MacRoman. You should always choose a specific encoding in the constructor for the InputStreamReader, for example:
var in = new InputStreamReader(new FileInputStream("data.txt"), StandardCharsets.UTF_8);
See Section 2.1.8, “Character Encodings,” on p. 75 for more information on character encodings.
The Reader and Writer classes have only basic methods to read and write individual characters. As with streams, you use subclasses for processing strings and numbers.
2.1.5 How to Write Text Output
For text output, use a PrintWriter. That class has methods to print strings and numbers in text format. In order to print to a file, construct a PrintStream from a file name and a character encoding:
var out = new PrintWriter("employee.txt", StandardCharsets.UTF_8);
To write to a print writer, use the same print, println, and printf methods that you used with System.out. You can use these methods to print numbers (int, short, long, float, double), characters, boolean values, strings, and objects.
For example, consider this code:
String name = "Harry Hacker"; double salary = 75000; out.print(name); out.print(’ ‘); out.println(salary);
This writes the characters
Harry Hacker 75000.0
to the writer out. The characters are then converted to bytes and end up in the file employee.txt.
The println method adds the correct end-of-line character for the target system ("\r\n" on Windows, "\n" on UNIX) to the line. This is the string obtained by the call System.getProperty("line.separator").
If the writer is set to autoflush mode, all characters in the buffer are sent to their destination whenever println is called. (Print writers are always buffered.) By default, autoflushing is not enabled. You can enable or disable autoflushing by using the PrintWriter(Writer writer, boolean autoFlush) constructor:
var out = new PrintWriter( new OutputStreamWriter( new FileOutputStream("employee.txt"), StandardCharsets.UTF_8), true); // autoflush
The print methods don’t throw exceptions. You can call the checkError method to see if something went wrong with the output stream.
2.1.6 How to Read Text Input
The easiest way to process arbitrary text is the Scanner class that we used extensively in Volume I. You can construct a Scanner from any input stream.
Alternatively, you can read a short text file into a string like this:
var content = new String(Files.readAllBytes(path), charset);
But if you want the file as a sequence of lines, call
List<String> lines = Files.readAllLines(path, charset);
If the file is large, process the lines lazily as a Stream<String>:
try (Stream<String> lines = Files.lines(path, charset)) { . . . }
You can also use a scanner to read tokens—strings that are separated by a delimiter. The default delimiter is white space. You can change the delimiter to any regular expression. For example,
Scanner in = . . .; in.useDelimiter("\\PL+");
accepts any non-Unicode letters as delimiters. The scanner then accepts tokens consisting only of Unicode letters.
Calling the next method yields the next token:
while (in.hasNext()) { String word = in.next(); . . . }
Alternatively, you can obtain a stream of all tokens as
Stream<String> words = in.tokens();
In early versions of Java, the only game in town for processing text input was the BufferedReader class. Its readLine method yields a line of text, or null when no more input is available. A typical input loop looks like this:
InputStream inputStream = . . .; try (var in = new BufferedReader(new InputStreamReader(inputStream, charset))) { String line; while ((line = in.readLine()) != null) { do something with line } }
Nowadays, the BufferedReader class also has a lines method that yields a Stream<String>. However, unlike a Scanner, a BufferedReader has no methods for reading numbers.
2.1.7 Saving Objects in Text Format
In this section, we walk you through an example program that stores an array of Employee records in a text file. Each record is stored in a separate line. Instance fields are separated from each other by delimiters. We use a vertical bar (|) as our delimiter. (A colon (:) is another popular choice. Part of the fun is that everyone uses a different delimiter.) Naturally, we punt on the issue of what might happen if a | actually occurs in one of the strings we save.
Here is a sample set of records:
Harry Hacker|35500|1989-10-01 Carl Cracker|75000|1987-12-15 Tony Tester|38000|1990-03-15
Writing records is simple. Since we write to a text file, we use the PrintWriter class. We simply write all fields, followed by either a | or, for the last field, a newline character. This work is done in the following writeData method that we add to our Employee class:
public static void writeEmployee(PrintWriter out, Employee e) { out.println(e.getName() + "|" + e.getSalary() + "|" + e.getHireDay()); }
To read records, we read in a line at a time and separate the fields. We use a scanner to read each line and then split the line into tokens with the String.split method.
public static Employee readEmployee(Scanner in) { String line = in.nextLine(); String[] tokens = line.split("\\|"); String name = tokens[0]; double salary = Double.parseDouble(tokens[1]); LocalDate hireDate = LocalDate.parse(tokens[2]); int year = hireDate.getYear(); int month = hireDate.getMonthValue(); int day = hireDate.getDayOfMonth(); return new Employee(name, salary, year, month, day); }
The parameter of the split method is a regular expression describing the separator. We discuss regular expressions in more detail at the end of this chapter. As it happens, the vertical bar character has a special meaning in regular expressions, so it needs to be escaped with a \ character. That character needs to be escaped by another \, yielding the "\\|" expression.
The complete program is in Listing 2.1. The static method
void writeData(Employee[] e, PrintWriter out)
first writes the length of the array, then writes each record. The static method
Employee[] readData(BufferedReader in)
first reads in the length of the array, then reads in each record. This turns out to be a bit tricky:
int n = in.nextInt(); in.nextLine(); // consume newline var employees = new Employee[n]; for (int i = 0; i < n; i++) { employees[i] = new Employee(); employees[i].readData(in); }
The call to nextInt reads the array length but not the trailing newline character. We must consume the newline so that the readData method can get the next input line when it calls the nextLine method.
Listing 2.1 textFile/TextFileTest.java
1 package textFile; 2 3 import java.io.*; 4 import java.nio.charset.*; 5 import java.time.*; 6 import java.util.*; 7 8 /** 9 * @version 1.15 2018-03-17 10 * @author Cay Horstmann 11 */ 12 public class TextFileTest 13 { 14 public static void main(String[] args) throws IOException 15 { 16 var staff = new Employee[3]; 17 18 staff[0] = new Employee("Carl Cracker", 75000, 1987, 12, 15); 19 staff[1] = new Employee("Harry Hacker", 50000, 1989, 10, 1); 20 staff[2] = new Employee("Tony Tester", 40000, 1990, 3, 15); 21 22 // save all employee records to the file employee.dat 23 try (var out = new PrintWriter("employee.dat", StandardCharsets.UTF_8)) 24 { 25 writeData(staff, out); 26 } 27 28 // retrieve all records into a new array 29 try (var in = new Scanner( 30 new FileInputStream("employee.dat"), "UTF-8")) 31 { 32 Employee[] newStaff = readData(in); 33 34 // print the newly read employee records 35 for (Employee e : newStaff) 36 System.out.println(e); 37 } 38 } 39 40 /** 41 * Writes all employees in an array to a print writer 42 * @param employees an array of employees 43 * @param out a print writer 44 */ 45 private static void writeData(Employee[] employees, PrintWriter out) 46 throws IOException 47 { 48 // write number of employees 49 out.println(employees.length); 50 51 for (Employee e : employees) 52 writeEmployee(out, e); 53 } 54 55 /** 56 * Reads an array of employees from a scanner 57 * @param in the scanner 58 * @return the array of employees 59 */ 60 private static Employee[] readData(Scanner in) 61 { 62 // retrieve the array size 63 int n = in.nextInt(); 64 in.nextLine(); // consume newline 65 66 var employees = new Employee[n]; 67 for (int i = 0; i < n; i++) 68 { 69 employees[i] = readEmployee(in); 70 } 71 return employees; 72 } 73 74 /** 75 * Writes employee data to a print writer 76 * @param out the print writer 77 */ 78 public static void writeEmployee(PrintWriter out, Employee e) 79 { 80 out.println(e.getName() + "|" + e.getSalary() + "|" + e.getHireDay()); 81 } 82 83 /** 84 * Reads employee data from a buffered reader 85 * @param in the scanner 86 */ 87 public static Employee readEmployee(Scanner in) 88 { 89 String line = in.nextLine(); 90 String[] tokens = line.split("\\|"); 91 String name = tokens[0]; 92 double salary = Double.parseDouble(tokens[1]); 93 LocalDate hireDate = LocalDate.parse(tokens[2]); 94 int year = hireDate.getYear(); 95 int month = hireDate.getMonthValue(); 96 int day = hireDate.getDayOfMonth(); 97 return new Employee(name, salary, year, month, day); 98 } 99 }
2.1.8 Character Encodings
Input and output streams are for sequences of bytes, but in many cases you will work with texts—that is, sequences of characters. It then matters how characters are encoded into bytes.
Java uses the Unicode standard for characters. Each character, or “code point,” has a 21-bit integer number. There are different character encodings—methods for packaging those 21-bit numbers into bytes.
The most common encoding is UTF-8, which encodes each Unicode code point into a sequence of one to four bytes (see Table 2.1). UTF-8 has the advantage that the characters of the traditional ASCII character set, which contains all characters used in English, only take up one byte each.
Table 2.1 UTF-8 Encoding
Character Range |
Encoding |
0. . .7F |
0a6a5a4a3a2a1a0 |
80. . .7FF |
110a10a9a8a7a6 10a5a4a3a2a1a0 |
800. . .FFFF |
1110a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0 |
10000. . .10FFFF |
11110a20a19a18 10a17a16a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0 |
Another common encoding is UTF-16, which encodes each Unicode code point into one or two 16-bit values (see Table 2.2). This is the encoding used in Java strings. Actually, there are two forms of UTF-16, called “big-endian” and “little-endian.” Consider the 16-bit value 0x2122. In the big-endian format, the more significant byte comes first: 0x21 followed by 0x22. In the little-endian format, it is the other way around: 0x22 0x21. To indicate which of the two is used, a file can start with the “byte order mark,” the 16-bit quantity 0xFEFF. A reader can use this value to determine the byte order and then discard it.
Table 2.2 UTF-16 Encoding
Character Range |
Encoding |
0. . .FFFF |
a15a14a13a12a11a10a9a8 a7a6a5a4a3a2a1a0 |
10000. . .10FFFF |
110110b19b18 b17b16a15a14a13a12a11a10 110111a9a8 a7a6a5a4a3a2a1a0 where b19b18b17b16 = a20a19a18a17a16 - 1 |
In addition to the UTF encodings, there are partial encodings that cover a character range suitable for a given user population. For example, ISO 8859-1 is a one-byte code that includes accented characters used in Western European languages. Shift-JIS is a variable-length code for Japanese characters. A large number of these encodings are still in widespread use.
There is no reliable way to automatically detect the character encoding from a stream of bytes. Some API methods let you use the “default charset”—the character encoding preferred by the operating system of the computer. Is that the same encoding that is used by your source of bytes? These bytes may well originate from a different part of the world. Therefore, you should always explicitly specify the encoding. For example, when reading a web page, check the Content-Type header.
The StandardCharsets class has static variables of type Charset for the character encodings that every Java virtual machine must support:
StandardCharsets.UTF_8 StandardCharsets.UTF_16 StandardCharsets.UTF_16BE StandardCharsets.UTF_16LE StandardCharsets.ISO_8859_1 StandardCharsets.US_ASCII
To obtain the Charset for another encoding, use the static forName method:
Charset shiftJIS = Charset.forName("Shift-JIS");
Use the Charset object when reading or writing text. For example, you can turn an array of bytes into a string as
var str = new String(bytes, StandardCharsets.UTF_8);