Implementation Patterns: Collections
I must say that I didn’t expect this chapter to amount to much. When I started writing it, I thought I would end up with an API document—types and operations. The basic idea is simple: a collection distinguishes between objects in the collection and those not in the collection. What more was there to say?
What I discovered is that collections are a far richer topic than I ever suspected, both in their structure and the possibilities they offer for communicating intent. The concept of collections blends several different metaphors. The metaphor you emphasize changes how you use collections. Each of the collection interfaces communicates a different variation on the theme of a sack of objects. Each of the implementations also communicates variations, mostly with regard to performance. The result is that mastering collections is a big part of learning to communicate well with code.
Collection-like behavior used to be implemented by providing links in the data structure itself: each page in a document would have links to the previous and next pages. More recently, the fashion has swung to using a separate object for the collection which relates elements. This allows the flexibility to put the same object in several different collections without modifying the object.
Collections are important because they are a way of expressing one of the most fundamental kinds of variation in programming, the variation of number. Variation in logic is expressed with conditionals or polymorphic messages. Variation in the cardinality of data is expressed by putting the data into a collection. The precise details of that collection reveal much about the intention of the original programmer to a reader.
There is an old (by computer terms) saying that the only interesting numbers are 0, 1 and many (this saying was not written by a numerical analyst). If the absence of a field expresses “zero” and the presence of a field expresses “one”, then a field holding a collection is a way of expressing “many”.
Collections hover in a strange world halfway between a programming language construct and a library. Collections are so universally useful, and their use is so well understood, that it almost seems time to have a mainstream language that allows statements like plural unique Book books; instead of the current Collection<Book> books= new HashSet<Book>();. Until collections are first-class language elements, it is important to know how to use the current collection library to express common ideas in straightforward ways.
The remainder of the chapter is divided into six parts: the metaphors behind collections, the issues to be expressed through the use of collections, the collection interfaces and what they mean to the reader, the collection implementations and what they say, an overview of functions available in the Collections class, and finally a discussion of extending collections through inheritance.
Metaphors
As suggested above, collections blend different metaphors. The first is that of a multi-valued variable. There is a sense in which a variable that refers to a collection is really a variable referring to several objects at the same time. Looked at this way, the collection disappears as a separate object. The collection’s identity is not interesting, only the objects to which it refers. As with all variables, you can assign to a multi-valued variable (add and remove elements), retrieve its value, and send the variable messages (with the for loop).
The multi-valued variable metaphor breaks down in Java because collections are separate objects with identity. The second metaphor mixed into collections is that of objects—a collection is an object. You can retrieve a collection, pass it around, test it for equality, and send it messages. Collections can be shared between objects, although this creates the possibility of aliasing problems. Because collections are a set of related interfaces and implementations, they are open to extension, both with expanded interfaces and new implementations. So, just as collections “are” multi-valued variables, they also “are” objects.
The combination of the two metaphors makes for some strange effects. Because a collection is implemented as an object that can be passed around, you get the equivalent of call-by-reference, where instead of passing a variable’s contents to a routine, you pass the variable itself. Changes to the variable’s value are reflected in the calling routine. Call-by-reference went out of fashion in language design a couple of decades ago because of the possibility for unintended consequences. It was hard to debug programs when you couldn’t be certain of all the places where a variable could be modified. Some of the conventions for programming with collections exist to avoid situations where it is hard to read the code and predict where a collection could be modified.
A third metaphor useful for thinking about collections is that of mathematical sets. A collection is a sack of objects just like a mathematical set is a sack of elements. A set divides the world into things in the set and things not in the set. A collection divides the world of objects into objects that are in the collection and objects that are not. Two basic operations on mathematical sets are finding their cardinality (the size() method of collections) and testing for inclusion (represented by the contains() method).
The mathematical metaphor is only approximate for collections. The other basic operations on sets—union, intersection, difference, and symmetric difference—are not directly represented by collections. Whether this is because these operations are intrinsically less useful or because they aren’t used because they aren’t available makes for an interesting debate.