Processing Input and Output

By Cay S. Horstmann
Oct 25, 2022

␡

9.1 Input/Output Streams, Readers, and Writers
9.2 Paths, Files, and Directories
9.3 HTTP Connections
9.4 Regular Expressions
9.5 Serialization
Exercises

⎙ Print

Page 1 of 6 Next >

Learn how to work with files, directories, and web pages, and how to read and write data in binary and text format. You will also find a discussion of regular expressions, which can be useful for processing input.

This chapter is from the book 

Core Java for the Impatient, 3rd Edition

Learn More Buy

In this chapter, you will learn how to work with files, directories, and web pages, and how to read and write data in binary and text format. You will also find a discussion of regular expressions, which can be useful for processing input. (I couldn’t think of a better place to handle that topic, and apparently neither could the Java developers—when the regular expression API specification was proposed, it was attached to the specification request for “new I/O” features.) Finally, this chapter shows you the object serialization mechanism that lets you store objects as easily as you can store text or numeric data.

The key points of this chapter are:

An InputStream is a source of bytes, and an OutputStream is a destination for bytes.
A Reader reads characters, and a Writer writes them. Be sure to specify a character encoding.
The Files class has convenience methods for reading all bytes or lines of a file.
The DataInput and DataOutput interfaces have methods for writing numbers in binary format.
Use a RandomAccessFile or a memory-mapped file for random access.
A Path is an absolute or relative sequence of path components in a file system. Paths can be combined (or “resolved”).
Use the methods of the Files class to copy, move, or delete files and to recursively walk through a directory tree.
To read or update a ZIP file, use a ZIP file system.
You can read the contents of a web page with the URL class. To read metadata or write data, use the URLConnection class.
With the Pattern and Matcher classes, you can find all matches of a regular expression in a string, as well as the captured groups for each match.
The serialization mechanism can save and restore any object implementing the Serializable interface, provided its instance variables are also serializable.

9.1 Input/Output Streams, Readers, and Writers

In the Java API, a source from which one can read bytes is called an input stream. The bytes can come from a file, a network connection, or an array in memory. (These streams are unrelated to the streams of Chapter 8.) Similarly, a destination for bytes is an output stream. In contrast, readers and writers consume and produce sequences of characters. In the following sections, you will learn how to read and write bytes and characters.

9.1.1 Obtaining Streams

The easiest way to obtain a stream from a file is with the static methods

InputStream in = Files.newInputStream(path);
OutputStream out = Files.newOutputStream(path);

Here, path is an instance of the Path class that is covered in Section 9.2.1, “Paths” (page 312). It describes a path in a file system.

If you have a URL, you can read its contents from the input stream returned by the openStream method of the URL class:

var url = new URL("https://horstmann.com/index.html");
InputStream in = url.openStream();

Section 9.3, “HTTP Connections” (page 320) shows how to send data to a web server.

The ByteArrayInputStream class lets you read from an array of bytes.

byte[] bytes = ...;
var in = new ByteArrayInputStream(bytes);
Read from in

Conversely, to send output to a byte array, use a ByteArrayOutputStream:

var out = new ByteArrayOutputStream();
Write to out
byte[] bytes = out.toByteArray();

9.1.2 Reading Bytes

The InputStream class has a method to read a single byte:

InputStream in = ...;
int b = in.read();

This method either returns the byte as an integer between 0 and 255, or returns -1 if the end of input has been reached.

More commonly, you will want to read the bytes in bulk. The most convenient method is the readAllBytes method that simply reads all bytes from the stream into a byte array:

byte[] bytes = in.readAllBytes();

TIP

If you want to read all bytes from a file, call the convenience method

byte[] bytes = Files.readAllBytes(path);

If you want to read some, but not all bytes, provide a byte array and call the readNBytes method:

var bytes = new byte[len];
int bytesRead = in.readNBytes(bytes, offset, n);

The method reads until either n bytes are read or no further input is available, and returns the actual number of bytes read. If no input was available at all, the methods return -1.

Finally, you can skip bytes:

long bytesToSkip = ...;
in.skipNBytes(bytesToSkip);

9.1.3 Writing Bytes

The write methods of an OutputStream can write individual bytes and byte arrays.

OutputStream out = ...;
int b = ...;
out.write(b);
byte[] bytes = ...;
out.write(bytes);
out.write(bytes, start, length);

When you are done writing a stream, you must close it in order to commit any buffered output. This is best done with a try-with-resources statement:

try (OutputStream out = ...) {
    out.write(bytes);
}

If you need to copy an input stream to an output stream, use the InputStream.transferTo method:

try (InputStream in = ...; OutputStream out = ...) {
    in.transferTo(out);
}

Both streams need to be closed after the call to transferTo. It is best to use a try-with-resources statement, as in the code example.

To write a file to an OutputStream, call

Files.copy(path, out);

Conversely, to save an InputStream to a file, call

Files.copy(in, path, StandardCopyOption.REPLACE_EXISTING);

9.1.4 Character Encodings

Input and output streams are for sequences of bytes, but in many cases you will work with text—that, is, sequences of characters. It then matters how characters are encoded into bytes.

Java uses the Unicode standard for characters. Each character or “code point” has a 21-bit integer number. There are different character encodings—methods for packaging those 21-bit numbers into bytes.

The most common encoding is UTF-8, which encodes each Unicode code point into a sequence of one to four bytes (see Table 9-1). UTF-8 has the advantage that the characters of the traditional ASCII character set, which contains all characters used in English, only take up one byte each.

Table 9-1 UTF-8 Encoding

Character range	Encoding
`0...7F`	`0a₆a₅a₄a₃a₂a₁a₀`
`80...7FF`	`110a₁₀a₉a₈a₇a₆ 10a₅a₄a₃a₂a₁a₀`
`800...FFFF`	`1110a₁₅a₁₄a₁₃a₁₂ 10a₁₁a₁₀a₉a₈a₇a₆ 10a₅a₄a₃a₂a₁a₀`
`10000...10FFFF`	`11110a₂₀a₁₉a₁₈ 10a₁₇a₁₆a₁₅a₁₄a₁₃a₁₂ 10a₁₁a₁₀a₉a₈a₇a₆ 10a₅a₄a₃a₂a₁a₀`

Another common encoding is UTF-16, which encodes each Unicode code point into one or two 16-bit values (see Table 9-2). This is the encoding used in Java strings. Actually, there are two forms of UTF-16, called “big-endian” and “little-endian.” Consider the 16-bit value 0x2122. In big-endian format, the more significant byte comes first: 0x21 followed by 0x22. In little-endian format, it is the other way around: 0x22 0x21. To indicate which of the two is used, a file can start with the “byte order mark,” the 16-bit quantity 0xFEFF. A reader can use this value to determine the byte order and discard it.

Table 9-2 UTF-16 Encoding

Character range	Encoding
`0...FFFF`	`a₁₅a₁₄a₁₃a₁₂a₁₁a₁₀a₉a₈a₇a₆a₅a₄a₃a₂a₁a₀`
`10000...10FFFF`	`110110b₁₉b₁₈b₁₇b₁₆a₁₅a₁₄a₁₃a₁₂a₁₁a₁₀ 110111a₉a₈a₇a₆a₅a₄a₃a₂a₁a₀` where `b₁₉b₁₈b₁₇b₁₆` = `a₂₀a₁₉a₁₈a₁₇a₁₆` – 1

Character range

Encoding

0...FFFF

a₁₅a₁₄a₁₃a₁₂a₁₁a₁₀a₉a₈a₇a₆a₅a₄a₃a₂a₁a₀

10000...10FFFF

110110b₁₉b₁₈b₁₇b₁₆a₁₅a₁₄a₁₃a₁₂a₁₁a₁₀ 110111a₉a₈a₇a₆a₅a₄a₃a₂a₁a₀

where b₁₉b₁₈b₁₇b₁₆ = a₂₀a₁₉a₁₈a₁₇a₁₆ – 1

CAUTION

Some programs, including Microsoft Notepad, add a byte order mark at the beginning of UTF-8 encoded files. Clearly, this is unnecessary since there are no byte ordering issues in UTF-8. But the Unicode standard allows it, and even suggests that it’s a pretty good idea since it leaves little doubt about the encoding. It is supposed to be removed when reading a UTF-8 encoded file. Sadly, Java does not do that, and bug reports against this issue are closed as “will not fix.” Your best bet is to strip out any leading \uFEFF that you find in your input.

In addition to the UTF encodings, there are partial encodings that cover a character range suitable for a given user population. For example, ISO 8859-1 is a one-byte code that includes accented characters used in Western European languages. Shift_JIS is a variable-length code for Japanese characters. A large number of these encodings are still in widespread use.

There is no reliable way to automatically detect the character encoding from a stream of bytes. Some API methods let you use the “default charset”—the character encoding that is preferred by the operating system of the computer. Is that the same encoding that is used by your source of bytes? These bytes may well originate from a different part of the world. Therefore, you should always explicitly specify the encoding. For example, when reading a web page, check the Content-Type header.

The StandardCharsets class has static variables of type Charset for the character encodings that every Java virtual machine must support:

StandardCharsets.UTF_8
StandardCharsets.UTF_16
StandardCharsets.UTF_16BE
StandardCharsets.UTF_16LE
StandardCharsets.ISO_8859_1
StandardCharsets.US_ASCII

To obtain the Charset for another encoding, use the static forName method:

Charset shiftJIS = Charset.forName("Shift_JIS");

Use the Charset object when reading or writing text. For example, you can turn an array of bytes into a string as

var contents = new String(bytes, StandardCharsets.UTF_8);

9.1.5 Text Input

To read text input, use a Reader. You can obtain a Reader from any input stream with the InputStreamReader adapter:

InputStream inStream = ...;
var in = new InputStreamReader(inStream, charset);

If you want to process the input one UTF-16 code unit at a time, you can call the read method:

int ch = in.read();

The method returns a code unit between 0 and 65536, or -1 at the end of input.

That is not very convenient. Here are several alternatives.

With a short text file, you can read it into a string like this:

String content = Files.readString(path, charset);

But if you want the file as a sequence of lines, call

List<String> lines = Files.readAllLines(path, charset);

If the file is large, process them lazily as a Stream<String>:

try (Stream<String> lines = Files.lines(path, charset)) {
    ...
}

To read numbers or words from a file, use a Scanner, as you have seen in Chapter 1. For example,

var in = new Scanner(path, StandardCharsets.UTF_8);
while (in.hasNextDouble()) {
    double value = in.nextDouble();
    ...
}

TIP

To read alphabetic words, set the scanner’s delimiter to a regular expression that is the complement of what you want to accept as a token. For example, after calling

in.useDelimiter("\\PL+");

the scanner reads in letters since any sequence of nonletters is a delimiter. See Section 9.4.1, “The Regular Expression Syntax” (page 324) for the regular expression syntax.

You can then obtain a stream of all words as

Stream<String> words = in.tokens();

If your input does not come from a file, wrap the InputStream into a BufferedReader:

try (var reader = new BufferedReader(new InputStreamReader(url.openStream()))) {
    Stream<String> lines = reader.lines();
    ...
}

A BufferedReader reads input in chunks for efficiency. (Oddly, this is not an option for basic readers.) It has methods readLine to read a single line and lines to yield a stream of lines.

If a method asks for a Reader and you want it to read from a file, call Files.newBufferedReader(path, charset).

9.1.6 Text Output

To write text, use a Writer. With the write method, you can write strings. You can turn any output stream into a Writer:

OutputStream outStream = ...;
var out = new OutputStreamWriter(outStream, charset);
out.write(str);

To get a writer for a file, use

Writer out = Files.newBufferedWriter(path, charset);

It is more convenient to use a PrintWriter, which has the print, println, and printf that you have always used with System.out. Using those methods, you can print numbers and use formatted output.

If you write to a file, construct a PrintWriter like this:

var out = new PrintWriter(Files.newBufferedWriter(path, charset));

If you write to another stream, use

var out = new PrintWriter(new OutputStreamWriter(outStream, charset));

If you already have the text to write in a string, call

String content = ...;
Files.write(path, content.getBytes(charset));

Files.write(path, lines, charset);

Here, lines can be a Collection<String>, or even more generally, an Iterable<? extends CharSequence>.

To append to a file, use

Files.write(path, content.getBytes(charset), StandardOpenOption.APPEND);
Files.write(path, lines, charset, StandardOpenOption.APPEND);

Sometimes, a library method wants a Writer to write output. If you want to capture that output in a string, hand it a StringWriter. Or, if it wants a PrintWriter, wrap the StringWriter like this:

var writer = new StringWriter();
throwable.printStackTrace(new PrintWriter(writer));
String stackTrace = writer.toString();

9.1.7 Reading and Writing Binary Data

The DataInput interface declares the following methods for reading a number, a character, a boolean value, or a string in binary format:

byte readByte()
int readUnsignedByte()
char readChar()
short readShort()
int readUnsignedShort()
int readInt()
long readLong()
float readFloat()
double readDouble()
void readFully(byte[] b)

The DataOutput interface declares corresponding write methods.

The advantage of binary I/O is that it is fixed width and efficient. For example, writeInt always writes an integer as a big-endian 4-byte binary quantity regardless of the number of digits. The space needed is the same for each value of a given type, which speeds up random access. Also, reading binary data is faster than parsing text. The main drawback is that the resulting files cannot be easily inspected in a text editor.

You can use the DataInputStream and DataOutputStream adapters with any stream. For example,

DataInput in = new DataInputStream(Files.newInputStream(path));
DataOutput out = new DataOutputStream(Files.newOutputStream(path));

9.1.8 Random-Access Files

The RandomAccessFile class lets you read or write data anywhere in a file. You can open a random-access file either for reading only or for both reading and writing; specify the option by using the string "r" (for read access) or "rw" (for read/write access) as the second argument in the constructor. For example,

var file = new RandomAccessFile(path.toString(), "rw");

A random-access file has a file pointer that indicates the position of the next byte to be read or written. The seek method sets the file pointer to an arbitrary byte position within the file. The argument to seek is a long integer between zero and the length of the file (which you can obtain with the length method). The getFilePointer method returns the current position of the file pointer.

The RandomAccessFile class implements both the DataInput and DataOutput interfaces. To read and write numbers from a random-access file, use methods such as readInt/writeInt that you saw in the preceding section. For example,

int value = file.readInt();
file.seek(file.getFilePointer() - 4);
file.writeInt(value + 1);

9.1.9 Memory-Mapped Files

Memory-mapped files provide another, very efficient approach for random access that works well for very large files. However, the API for data access is completely different from that of input/output streams. First, get a channel to the file:

FileChannel channel = FileChannel.open(path,
    StandardOpenOption.READ, StandardOpenOption.WRITE)

Then, map an area of the file (or, if it is not too large, the entire file) into memory:

ByteBuffer buffer = channel.map(FileChannel.MapMode.READ_WRITE,
    0, channel.size());

Use methods get, getInt, getDouble, and so on to read values, and the equivalent put methods to write values.

int offset = ...;
int value = buffer.getInt(offset);
buffer.put(offset, value + 1);

At some point, and certainly when the channel is closed, these changes are written back to the file.

NOTE

By default, the methods for reading and writing numbers use big-endian byte order. You can change the byte order with the command

buffer.order(ByteOrder.LITTLE_ENDIAN);

9.1.10 File Locking

When multiple simultaneously executing programs modify the same file, they need to communicate in some way, or the file can easily become damaged. File locks can solve this problem.

Suppose your application saves a configuration file with user preferences. If a user invokes two instances of the application, it could happen that both of them want to write the configuration file at the same time. In that situation, the first instance should lock the file. When the second instance finds the file locked, it can decide to wait until the file is unlocked or simply skip the writing process. To lock a file, call either the lock or tryLock methods of the FileChannel class.

FileChannel channel = FileChannel.open(path, StandardOpenOption.WRITE);
FileLock lock = channel.lock();

FileLock lock = channel.tryLock();

The first call blocks until the lock becomes available. The second call returns immediately, either with the lock or with null if the lock is not available. The file remains locked until the lock or the channel is closed. It is best to use a try-with-resources statement:

try (FileLock lock = channel.lock()) {
    ...
}