Processing Input and Output
- 9.1. Input/Output Streams, Readers, and Writers
- 9.2. Paths, Files, and Directories
- 9.3. HTTP Connections
- 9.4. Regular Expressions
- 9.5. Serialization
- Exercises
In this chapter, you will learn how to work with files, directories, and web pages, and how to read and write data in binary and text format. You will also find a discussion of regular expressions, which can be useful for processing input. (I couldn’t think of a better place to handle that topic, and apparently neither could the Java developers—when the regular expression API specification was proposed, it was attached to the specification request for “new I/O” features.) Finally, this chapter shows you the object serialization mechanism that lets you store objects as easily as you can store text or numeric data.
The key points of this chapter are:
An
InputStream
is a source of bytes, and anOutputStream
is a destination for bytes.A
Reader
reads characters, and aWriter
writes them. Be sure to specify a character encoding.The
Files
class has convenience methods for reading all bytes or lines of a file.The
DataInput
andDataOutput
interfaces have methods for writing numbers in binary format.Use a
RandomAccessFile
or a memory-mapped file for random access.A
Path
is an absolute or relative sequence of path components in a file system. Paths can be combined (or “resolved”).Use the methods of the
Files
class to copy, move, or delete files and to recursively walk through a directory tree.To read or update a ZIP file, use a ZIP file system.
You can read the contents of a web page with the
URL
class. To read metadata or write data, use theURLConnection
class.With the
Pattern
andMatcher
classes, you can find all matches of a regular expression in a string, as well as the captured groups for each match.The serialization mechanism can save and restore any object implementing the
Serializable
interface, provided its instance variables are also serializable.
9.1. Input/Output Streams, Readers, and Writers
In the Java API, a source from which one can read bytes is called an input stream. The bytes can come from a file, a network connection, or an array in memory. (These streams are unrelated to the streams of 8.) Similarly, a destination for bytes is an output stream. In contrast, readers and writers consume and produce sequences of characters. In the following sections, you will learn how to read and write bytes and characters.
9.1.1. Obtaining Streams
The easiest way to obtain a stream from a file is with the static methods
InputStream in = Files.newInputStream(path); OutputStream out = Files.newOutputStream(path);
Here, path
is an instance of the Path
class that is covered in Section 9.2.1. It describes a path in a file system.
If you have an URL
object, you can read its contents from the input stream returned by the openStream
method. (The URL
constructors are deprecated, and you should create an URL
instance as shown here.)
var url = URI.create("https://horstmann.com/index.html").toURL(); InputStream in = url.openStream();
Section 9.3 shows how to send data to a web server.
The ByteArrayInputStream
class lets you read from an array of bytes.
byte[] bytes = ...;
var in = new ByteArrayInputStream(bytes);
Read from in
Conversely, to send output to a byte array, use a ByteArrayOutputStream
:
var out = new ByteArrayOutputStream();
Write to out
byte[] bytes = out.toByteArray();
9.1.2. Reading Bytes
The InputStream
class has a method to read a single byte:
InputStream in = ...; int b = in.read();
This method either returns the byte as an integer between 0
and 255
, or returns -1
if the end of input has been reached.
More commonly, you will want to read the bytes in bulk. The most convenient method is the readAllBytes
method that simply reads all bytes from the stream into a byte array:
byte[] bytes = in.readAllBytes();
If you want to read some, but not all bytes, provide a byte array and call the readNBytes
method:
var bytes = new byte[len]; int bytesRead = in.readNBytes(bytes, offset, n);
The method reads until either n
bytes are read or no further input is available, and returns the actual number of bytes read. If no input was available at all, the methods return -1
.
Finally, you can skip bytes:
long bytesToSkip = ...; in.skipNBytes(bytesToSkip);
9.1.3. Writing Bytes
The write
methods of an OutputStream
can write individual bytes and byte arrays.
OutputStream out = ...; int b = ...; out.write(b); byte[] bytes = ...; out.write(bytes); out.write(bytes, start, length);
When you are done writing a stream, you must close it in order to commit any buffered output. This is best done with a try-with-resources statement:
try (OutputStream out = ...) { out.write(bytes); }
If you need to copy an input stream to an output stream, use the InputStream.transferTo
method:
try (InputStream in = ...; OutputStream out = ...) { in.transferTo(out); }
Both streams need to be closed after the call to transferTo
. It is best to use a try-with-resources statement, as in the code example.
To write a file to an OutputStream
, call
Files.copy(path, out);
Conversely, to save an InputStream
to a file, call
Files.copy(in, path, StandardCopyOption.REPLACE_EXISTING);
9.1.4. Character Encodings
Input and output streams are for sequences of bytes, but in many cases you will work with text—that, is, sequences of characters. It then matters how characters are encoded into bytes.
Java uses the Unicode standard for characters. Each character or “code point” has a 21-bit integer number. There are different character encodings—methods for packaging those 21-bit numbers into bytes.
The most common encoding is UTF-8, which encodes each Unicode code point into a sequence of one to four bytes (see Table 9.1). UTF-8 has the advantage that the characters of the traditional ASCII character set, which contains all characters used in English, only take up one byte each.
Table 9.1: UTF-8 Encoding
Character range
Encoding
0...7F
0a6a5a4a3a2a1a0
80...7FF
110a10a9a8a7a6 10a5a4a3a2a1a0
800...FFFF
1110a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0
10000...10FFFF
11110a20a19a18 10a17a16a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0
A less common encoding is UTF-16, which encodes each Unicode code point into one or two 16-bit values (see Table 9.2). This is the encoding used in Java strings. Actually, there are two forms of UTF-16, called “big-endian” and “little-endian.” Consider the 16-bit value 0x2122
. In big-endian format, the more significant byte comes first: 0x21
followed by 0x22
. In little-endian format, it is the other way around: 0x22 0x21
. To indicate which of the two is used, a file can start with the “byte order mark,” the 16-bit quantity 0xFEFF
. A reader can use this value to determine the byte order and discard it.
Table 9.2: UTF-16 Encoding
Character range
Encoding
0...FFFF
a15a14a13a12a11a10a9a8a7a6a5a4a3a2a1a0
10000...10FFFF
110110b19b18b17b16a15a14a13a12a11a10 110111a9a8a7a6a5a4a3a2a1a0
where b19b18b17b16
= a20a19a18a17a16
– 1
In addition to the UTF encodings, there are partial encodings that cover a character range suitable for a given user population. For example, ISO 8859-1 is a one-byte code that includes accented characters used in Western European languages. Shift_JIS is a variable-length code for Japanese characters. A large number of these encodings are still in widespread use.
Because UTF-8 is so common, it has become the default encoding since Java 18. Previously, the default encoding was the native encoding—the character encoding that is preferred by the operating system of the computer running your program. On Windows, that is generally not UTF-8. If you are using an older version of Java, or if you are working with text in an encoding other than UTF-8, you need to explicitly specify the encoding.
The StandardCharsets
class has static variables of type Charset
for the character encodings that every Java virtual machine must support:
StandardCharsets.UTF_8 StandardCharsets.UTF_16 StandardCharsets.UTF_16BE StandardCharsets.UTF_16LE StandardCharsets.ISO_8859_1 StandardCharsets.US_ASCII
To obtain the Charset
for another encoding, use the static forName
method:
Charset shiftJIS = Charset.forName("Shift_JIS");
You use the Charset
object to specify a character encoding. For example, you can turn an array of bytes into a string as
var contents = new String(bytes, StandardCharsets.ISO_8859_1);
9.1.5. Text Input
To read text input, use a Reader
. You can obtain a Reader
from any input stream with the InputStreamReader
adapter:
InputStream inStream = ...; var in = new InputStreamReader(inStream, charset);
If you want to process the input one UTF-16 code unit at a time, you can call the read
method:
int ch = in.read();
The method returns a code unit between 0
and 65536
, or -1
at the end of input.
That is not very convenient. Here are several alternatives.
With a short text file, you can read it into a string like this:
String content = Files.readString(path, charset);
But if you want the file as a sequence of lines, call
List<String> lines = Files.readAllLines(path, charset);
If the file is large, process them lazily as a Stream<String>
:
try (Stream<String> lines = Files.lines(path, charset)) { ... }
To read numbers or words from a file, use a Scanner
, as you have seen in 1. For example,
var in = new Scanner(path); while (in.hasNextDouble()) { double value = in.nextDouble(); ... }
If your input does not come from a file, wrap the InputStream
into a BufferedReader
:
try (var reader = new BufferedReader(new InputStreamReader(url.openStream()))) { Stream<String> lines = reader.lines(); ... }
A BufferedReader
reads input in chunks for efficiency. (Oddly, this is not an option for basic readers.) It has methods readLine
to read a single line and lines
to yield a stream of lines.
If a method asks for a Reader
and you want it to read from a file, call Files.newBufferedReader(path, charset)
.
9.1.6. Text Output
To write text, use a Writer
. With the write
method, you can write strings. You can turn any output stream into a Writer
:
OutputStream outStream = ...; var out = new OutputStreamWriter(outStream, charset); out.write(str);
To get a writer for a file, use
Writer out = Files.newBufferedWriter(path, charset);
It is more convenient to use a PrintWriter
, which has the print
, println
, and printf
that you have always used with System.out
. Using those methods, you can print numbers and use formatted output.
If you write to a file, construct a PrintWriter
like this:
var out = new PrintWriter(Files.newBufferedWriter(path, charset));
If you write to another stream, use
var out = new PrintWriter(new OutputStreamWriter(outStream, charset));
If you already have the text to write in a string, call
String content = ...; Files.writeString(path, content, charset);
or
Files.write(path, lines, charset);
Here, lines
can be a Collection<String>
, or even more generally, an Iterable<? extends CharSequence>
.
To append to a file, use
Files.writeString(path, charset, StandardOpenOption.APPEND); Files.write(path, lines, charset, StandardOpenOption.APPEND);
Sometimes, a library method wants a Writer
to write output. If you want to capture that output in a string, hand it a StringWriter
. Or, if it wants a PrintWriter
, wrap the StringWriter
like this:
var writer = new StringWriter(); throwable.printStackTrace(new PrintWriter(writer)); String stackTrace = writer.toString();
9.1.7. Reading Character Input
If you read a file with a structured format such as JSON or XML, you will use a parser that someone wrote who understands the fiddly details of that format. Such a parser typically reads a character at a time.
In the uncommon case that you need to write such a parser, use a BufferedReader
for efficiency. Keep calling its read
method, which yields a char
value or -1 at the end of input. The reader converts the encoding of the input stream into UTF-16.
If you want to process Unicode code points, you need to handle the UTF-16 encoding. Here is how to read one code point:
int ch = reader.read(); if (ch != -1) { int codePoint; if (Character.isHighSurrogate((char) ch)) { int ch2 = reader.read(); if (Character.isLowSurrogate((char) ch2)) codePoint = Character.toCodePoint(ch, ch2); else throw new MalformedInputException(); } else codePoint = ch; }
The Character
class contains methods to tell whether a particular code point has a given property. For example,
Character.isLetter(codePoint)
returns true
if codePoint
is a letter in some language. Here are some other classification methods:
isUpperCase isLowerCase isDigit isSpaceChar isEmoji
These methods use the rules of the Unicode standard. Others refer to the rules of the Java language:
isJavaIdentifierStart isJavaIdentifierPart isWhitespace
After analyzing the code points, you often need to store them in strings, converting them back to UTF-16. The appendCodePoint
method of the StringBuilder
class turns a code point into one or two char
values which are appended to the builder.
9.1.8. Reading and Writing Binary Data
The DataInput
interface declares the following methods for reading a number, a character, a boolean
value, or a string in binary format:
byte readByte() int readUnsignedByte() char readChar() short readShort() int readUnsignedShort() int readInt() long readLong() float readFloat() double readDouble() void readFully(byte[] b)
The DataOutput
interface declares corresponding write
methods.
The advantage of binary I/O is that it is fixed width and efficient. For example, writeInt
always writes an integer as a big-endian 4-byte binary quantity regardless of the number of digits. The space needed is the same for each value of a given type, which speeds up random access. Also, reading binary data is faster than parsing text. The main drawback is that the resulting files cannot be easily inspected in a text editor.
You can use the DataInputStream
and DataOutputStream
adapters with any stream. For example,
DataInput in = new DataInputStream(Files.newInputStream(path)); DataOutput out = new DataOutputStream(Files.newOutputStream(path));
9.1.9. Random-Access Files
The RandomAccessFile
class lets you read or write data anywhere in a file. You can open a random-access file either for reading only or for both reading and writing; specify the option by using the string "r"
(for read access) or "rw"
(for read/write access) as the second argument in the constructor. For example,
var file = new RandomAccessFile(path.toString(), "rw");
A random-access file has a file pointer that indicates the position of the next byte to be read or written. The seek
method sets the file pointer to an arbitrary byte position within the file. The argument to seek
is a long integer between zero and the length of the file (which you can obtain with the length
method). The getFilePointer
method returns the current position of the file pointer.
The RandomAccessFile
class implements both the DataInput
and DataOutput
interfaces. To read and write numbers from a random-access file, use methods such as readInt
/writeInt
that you saw in the preceding section. For example,
int value = file.readInt(); file.seek(file.getFilePointer() - 4); file.writeInt(value + 1);
9.1.10. Memory-Mapped Files
Memory-mapped files provide another, very efficient approach for random access that works well for very large files. However, the API for data access is completely different from that of input/output streams. First, get a channel to the file:
FileChannel channel = FileChannel.open(path, StandardOpenOption.READ, StandardOpenOption.WRITE)
Then, map an area of the file (or, if it is not too large, the entire file) into memory:
ByteBuffer buffer = channel.map(FileChannel.MapMode.READ_WRITE, 0, channel.size());
Use methods get
, getInt
, getDouble
, and so on to read values, and the equivalent put
methods to write values.
int offset = ...; int value = buffer.getInt(offset); buffer.put(offset, value + 1);
At some point, and certainly when the channel is closed, these changes are written back to the file.
9.1.11. File Locking
When multiple simultaneously executing programs modify the same file, they need to communicate in some way, or the file can easily become damaged. File locks can solve this problem.
Suppose your application saves a configuration file with user preferences. If a user invokes two instances of the application, it could happen that both of them want to write the configuration file at the same time. In that situation, the first instance should lock the file. When the second instance finds the file locked, it can decide to wait until the file is unlocked or simply skip the writing process. To lock a file, call either the lock
or tryLock
methods of the FileChannel
class.
FileChannel channel = FileChannel.open(path, StandardOpenOption.WRITE); FileLock lock = channel.lock();
or
FileLock lock = channel.tryLock();
The first call blocks until the lock becomes available. The second call returns immediately, either with the lock or with null
if the lock is not available. The file remains locked until the lock or the channel is closed. It is best to use a try-with-resources statement:
try (FileLock lock = channel.lock()) { ... }