Processing Input and Output
- 9.1 Input/Output Streams, Readers, and Writers
- 9.2 Paths, Files, and Directories
- 9.3 HTTP Connections
- 9.4 Regular Expressions
- 9.5 Serialization
- Exercises
Learn how to work with files, directories, and web pages, and how to read and write data in binary and text format. You will also find a discussion of regular expressions, which can be useful for processing input.
In this chapter, you will learn how to work with files, directories, and web pages, and how to read and write data in binary and text format. You will also find a discussion of regular expressions, which can be useful for processing input. (I couldn’t think of a better place to handle that topic, and apparently neither could the Java developers—when the regular expression API specification was proposed, it was attached to the specification request for “new I/O” features.) Finally, this chapter shows you the object serialization mechanism that lets you store objects as easily as you can store text or numeric data.
The key points of this chapter are:
An InputStream is a source of bytes, and an OutputStream is a destination for bytes.
A Reader reads characters, and a Writer writes them. Be sure to specify a character encoding.
The Files class has convenience methods for reading all bytes or lines of a file.
The DataInput and DataOutput interfaces have methods for writing numbers in binary format.
Use a RandomAccessFile or a memory-mapped file for random access.
A Path is an absolute or relative sequence of path components in a file system. Paths can be combined (or “resolved”).
Use the methods of the Files class to copy, move, or delete files and to recursively walk through a directory tree.
To read or update a ZIP file, use a ZIP file system.
You can read the contents of a web page with the URL class. To read metadata or write data, use the URLConnection class.
With the Pattern and Matcher classes, you can find all matches of a regular expression in a string, as well as the captured groups for each match.
The serialization mechanism can save and restore any object implementing the Serializable interface, provided its instance variables are also serializable.
9.1 Input/Output Streams, Readers, and Writers
In the Java API, a source from which one can read bytes is called an input stream. The bytes can come from a file, a network connection, or an array in memory. (These streams are unrelated to the streams of Chapter 8.) Similarly, a destination for bytes is an output stream. In contrast, readers and writers consume and produce sequences of characters. In the following sections, you will learn how to read and write bytes and characters.
9.1.1 Obtaining Streams
The easiest way to obtain a stream from a file is with the static methods
InputStream in = Files.newInputStream(path); OutputStream out = Files.newOutputStream(path);
Here, path is an instance of the Path class that is covered in Section 9.2.1, “Paths” (page 312). It describes a path in a file system.
If you have a URL, you can read its contents from the input stream returned by the openStream method of the URL class:
var url = new URL("https://horstmann.com/index.html"); InputStream in = url.openStream();
Section 9.3, “HTTP Connections” (page 320) shows how to send data to a web server.
The ByteArrayInputStream class lets you read from an array of bytes.
byte[] bytes = ...; var in = new ByteArrayInputStream(bytes); Read from in
Conversely, to send output to a byte array, use a ByteArrayOutputStream:
var out = new ByteArrayOutputStream(); Write to out byte[] bytes = out.toByteArray();
9.1.2 Reading Bytes
The InputStream class has a method to read a single byte:
InputStream in = ...; int b = in.read();
This method either returns the byte as an integer between 0 and 255, or returns -1 if the end of input has been reached.
More commonly, you will want to read the bytes in bulk. The most convenient method is the readAllBytes method that simply reads all bytes from the stream into a byte array:
byte[] bytes = in.readAllBytes();
If you want to read some, but not all bytes, provide a byte array and call the readNBytes method:
var bytes = new byte[len]; int bytesRead = in.readNBytes(bytes, offset, n);
The method reads until either n bytes are read or no further input is available, and returns the actual number of bytes read. If no input was available at all, the methods return -1.
Finally, you can skip bytes:
long bytesToSkip = ...; in.skipNBytes(bytesToSkip);
9.1.3 Writing Bytes
The write methods of an OutputStream can write individual bytes and byte arrays.
OutputStream out = ...; int b = ...; out.write(b); byte[] bytes = ...; out.write(bytes); out.write(bytes, start, length);
When you are done writing a stream, you must close it in order to commit any buffered output. This is best done with a try-with-resources statement:
try (OutputStream out = ...) { out.write(bytes); }
If you need to copy an input stream to an output stream, use the InputStream.transferTo method:
try (InputStream in = ...; OutputStream out = ...) { in.transferTo(out); }
Both streams need to be closed after the call to transferTo. It is best to use a try-with-resources statement, as in the code example.
To write a file to an OutputStream, call
Files.copy(path, out);
Conversely, to save an InputStream to a file, call
Files.copy(in, path, StandardCopyOption.REPLACE_EXISTING);
9.1.4 Character Encodings
Input and output streams are for sequences of bytes, but in many cases you will work with text—that, is, sequences of characters. It then matters how characters are encoded into bytes.
Java uses the Unicode standard for characters. Each character or “code point” has a 21-bit integer number. There are different character encodings—methods for packaging those 21-bit numbers into bytes.
The most common encoding is UTF-8, which encodes each Unicode code point into a sequence of one to four bytes (see Table 9-1). UTF-8 has the advantage that the characters of the traditional ASCII character set, which contains all characters used in English, only take up one byte each.
Table 9-1 UTF-8 Encoding
Character range |
Encoding |
---|---|
0...7F |
0a6a5a4a3a2a1a0 |
80...7FF |
110a10a9a8a7a6 10a5a4a3a2a1a0 |
800...FFFF |
1110a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0 |
10000...10FFFF |
11110a20a19a18 10a17a16a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0 |
Another common encoding is UTF-16, which encodes each Unicode code point into one or two 16-bit values (see Table 9-2). This is the encoding used in Java strings. Actually, there are two forms of UTF-16, called “big-endian” and “little-endian.” Consider the 16-bit value 0x2122. In big-endian format, the more significant byte comes first: 0x21 followed by 0x22. In little-endian format, it is the other way around: 0x22 0x21. To indicate which of the two is used, a file can start with the “byte order mark,” the 16-bit quantity 0xFEFF. A reader can use this value to determine the byte order and discard it.
Table 9-2 UTF-16 Encoding
Character range |
Encoding |
---|---|
0...FFFF |
a15a14a13a12a11a10a9a8a7a6a5a4a3a2a1a0 |
10000...10FFFF |
110110b19b18b17b16a15a14a13a12a11a10 110111a9a8a7a6a5a4a3a2a1a0 where b19b18b17b16 = a20a19a18a17a16 – 1 |
In addition to the UTF encodings, there are partial encodings that cover a character range suitable for a given user population. For example, ISO 8859-1 is a one-byte code that includes accented characters used in Western European languages. Shift_JIS is a variable-length code for Japanese characters. A large number of these encodings are still in widespread use.
There is no reliable way to automatically detect the character encoding from a stream of bytes. Some API methods let you use the “default charset”—the character encoding that is preferred by the operating system of the computer. Is that the same encoding that is used by your source of bytes? These bytes may well originate from a different part of the world. Therefore, you should always explicitly specify the encoding. For example, when reading a web page, check the Content-Type header.
The StandardCharsets class has static variables of type Charset for the character encodings that every Java virtual machine must support:
StandardCharsets.UTF_8 StandardCharsets.UTF_16 StandardCharsets.UTF_16BE StandardCharsets.UTF_16LE StandardCharsets.ISO_8859_1 StandardCharsets.US_ASCII
To obtain the Charset for another encoding, use the static forName method:
Charset shiftJIS = Charset.forName("Shift_JIS");
Use the Charset object when reading or writing text. For example, you can turn an array of bytes into a string as
var contents = new String(bytes, StandardCharsets.UTF_8);
9.1.5 Text Input
To read text input, use a Reader. You can obtain a Reader from any input stream with the InputStreamReader adapter:
InputStream inStream = ...; var in = new InputStreamReader(inStream, charset);
If you want to process the input one UTF-16 code unit at a time, you can call the read method:
int ch = in.read();
The method returns a code unit between 0 and 65536, or -1 at the end of input.
That is not very convenient. Here are several alternatives.
With a short text file, you can read it into a string like this:
String content = Files.readString(path, charset);
But if you want the file as a sequence of lines, call
List<String> lines = Files.readAllLines(path, charset);
If the file is large, process them lazily as a Stream<String>:
try (Stream<String> lines = Files.lines(path, charset)) { ... }
To read numbers or words from a file, use a Scanner, as you have seen in Chapter 1. For example,
var in = new Scanner(path, StandardCharsets.UTF_8); while (in.hasNextDouble()) { double value = in.nextDouble(); ... }
If your input does not come from a file, wrap the InputStream into a BufferedReader:
try (var reader = new BufferedReader(new InputStreamReader(url.openStream()))) { Stream<String> lines = reader.lines(); ... }
A BufferedReader reads input in chunks for efficiency. (Oddly, this is not an option for basic readers.) It has methods readLine to read a single line and lines to yield a stream of lines.
If a method asks for a Reader and you want it to read from a file, call Files.newBufferedReader(path, charset).
9.1.6 Text Output
To write text, use a Writer. With the write method, you can write strings. You can turn any output stream into a Writer:
OutputStream outStream = ...; var out = new OutputStreamWriter(outStream, charset); out.write(str);
To get a writer for a file, use
Writer out = Files.newBufferedWriter(path, charset);
It is more convenient to use a PrintWriter, which has the print, println, and printf that you have always used with System.out. Using those methods, you can print numbers and use formatted output.
If you write to a file, construct a PrintWriter like this:
var out = new PrintWriter(Files.newBufferedWriter(path, charset));
If you write to another stream, use
var out = new PrintWriter(new OutputStreamWriter(outStream, charset));
If you already have the text to write in a string, call
String content = ...; Files.write(path, content.getBytes(charset));
or
Files.write(path, lines, charset);
Here, lines can be a Collection<String>, or even more generally, an Iterable<? extends CharSequence>.
To append to a file, use
Files.write(path, content.getBytes(charset), StandardOpenOption.APPEND); Files.write(path, lines, charset, StandardOpenOption.APPEND);
Sometimes, a library method wants a Writer to write output. If you want to capture that output in a string, hand it a StringWriter. Or, if it wants a PrintWriter, wrap the StringWriter like this:
var writer = new StringWriter(); throwable.printStackTrace(new PrintWriter(writer)); String stackTrace = writer.toString();
9.1.7 Reading and Writing Binary Data
The DataInput interface declares the following methods for reading a number, a character, a boolean value, or a string in binary format:
byte readByte() int readUnsignedByte() char readChar() short readShort() int readUnsignedShort() int readInt() long readLong() float readFloat() double readDouble() void readFully(byte[] b)
The DataOutput interface declares corresponding write methods.
The advantage of binary I/O is that it is fixed width and efficient. For example, writeInt always writes an integer as a big-endian 4-byte binary quantity regardless of the number of digits. The space needed is the same for each value of a given type, which speeds up random access. Also, reading binary data is faster than parsing text. The main drawback is that the resulting files cannot be easily inspected in a text editor.
You can use the DataInputStream and DataOutputStream adapters with any stream. For example,
DataInput in = new DataInputStream(Files.newInputStream(path)); DataOutput out = new DataOutputStream(Files.newOutputStream(path));
9.1.8 Random-Access Files
The RandomAccessFile class lets you read or write data anywhere in a file. You can open a random-access file either for reading only or for both reading and writing; specify the option by using the string "r" (for read access) or "rw" (for read/write access) as the second argument in the constructor. For example,
var file = new RandomAccessFile(path.toString(), "rw");
A random-access file has a file pointer that indicates the position of the next byte to be read or written. The seek method sets the file pointer to an arbitrary byte position within the file. The argument to seek is a long integer between zero and the length of the file (which you can obtain with the length method). The getFilePointer method returns the current position of the file pointer.
The RandomAccessFile class implements both the DataInput and DataOutput interfaces. To read and write numbers from a random-access file, use methods such as readInt/writeInt that you saw in the preceding section. For example,
int value = file.readInt(); file.seek(file.getFilePointer() - 4); file.writeInt(value + 1);
9.1.9 Memory-Mapped Files
Memory-mapped files provide another, very efficient approach for random access that works well for very large files. However, the API for data access is completely different from that of input/output streams. First, get a channel to the file:
FileChannel channel = FileChannel.open(path, StandardOpenOption.READ, StandardOpenOption.WRITE)
Then, map an area of the file (or, if it is not too large, the entire file) into memory:
ByteBuffer buffer = channel.map(FileChannel.MapMode.READ_WRITE, 0, channel.size());
Use methods get, getInt, getDouble, and so on to read values, and the equivalent put methods to write values.
int offset = ...; int value = buffer.getInt(offset); buffer.put(offset, value + 1);
At some point, and certainly when the channel is closed, these changes are written back to the file.
9.1.10 File Locking
When multiple simultaneously executing programs modify the same file, they need to communicate in some way, or the file can easily become damaged. File locks can solve this problem.
Suppose your application saves a configuration file with user preferences. If a user invokes two instances of the application, it could happen that both of them want to write the configuration file at the same time. In that situation, the first instance should lock the file. When the second instance finds the file locked, it can decide to wait until the file is unlocked or simply skip the writing process. To lock a file, call either the lock or tryLock methods of the FileChannel class.
FileChannel channel = FileChannel.open(path, StandardOpenOption.WRITE); FileLock lock = channel.lock();
or
FileLock lock = channel.tryLock();
The first call blocks until the lock becomes available. The second call returns immediately, either with the lock or with null if the lock is not available. The file remains locked until the lock or the channel is closed. It is best to use a try-with-resources statement:
try (FileLock lock = channel.lock()) { ... }