2.2 Libraries and Clients
EACH PROGRAM THAT YOU HAVE WRITTEN so far consists of Java code that resides in a single .java file. For large programs, keeping all the code in a single file in this way is restrictive and unnecessary. Fortunately, it is very easy in Java to refer to a method in one file that is defined in another. This ability has two important consequences on our style of programming.
First, it enables code reuse. One program can make use of code that is already written and debugged, not by copying the code, but just by referring to it. This ability to define code that can be reused is an essential part of modern programming. It amounts to extending Java—you can define and use your own operations on data.
Second, it enables modular programming. You can not only divide a program up into static methods, as just described in SECTION 2.1, but also keep those methods in different files, grouped together according to the needs of the application. Modular programming is important because it allows us to independently develop, compile, and debug parts of big programs one piece at a time, leaving each finished piece in its own file for later use without having to worry about its details again. We develop libraries of static methods for use by any other program, keeping each library in its own file and using its methods in any other program. Java’s Math library and our Std* libraries for input/output are examples that you have already used. More importantly, you will soon see that it is very easy to define libraries of your own. The ability to define libraries and then to use them in multiple programs is a critical aspect of our ability to build programs to address complex tasks.
Having just moved in SECTION 2.1 from thinking of a Java program as a sequence of statements to thinking of a Java program as a class comprising a set of static methods (one of which is main()), you will be ready after this section to think of a Java program as a set of classes, each of which is an independent module consisting of a set of methods. Since each method can call a method in another class, all of your code can interact as a network of methods that call one another, grouped together in classes. With this capability, you can start to think about managing complexity when programming by breaking up programming tasks into classes that can be implemented and tested independently.
Using static methods in other programs
To refer to a static method in one class that is defined in another, we use the same mechanism that we have been using to invoke methods such as Math.sqrt() and StdOut.println():
Make both classes accessible to Java (for example, by putting them both in the same directory in your computer).
To call a method, prepend its class name and a period separator.
For example, we might wish to write a simple client SAT.java that takes an SAT score z from the command line and prints the percentage of students scoring less than z in a given year (in which the mean score was 1,019 and its standard deviation was 209). To get the job done, SAT.java needs to compute Φ((z–1,019) / 209), a task perfectly suited for the cdf() method in Gaussian.java (PROGRAM 2.1.2). All that we need to do is to keep Gaussian.java in the same directory as SAT.java and prepend the class name when calling cdf(). Moreover, any other class in that directory can make use of the static methods defined in Gaussian, by calling Gaussian.pdf() or Gaussian.cdf(). The Math library is always accessible in Java, so any class can call Math.sqrt() and Math.exp(), as usual. The files Gaussian.java, SAT.java, and Math.java implement Java classes that interact with one another: SAT calls a method in Gaussian, which calls another method in Gaussian, which then calls two methods in Math.
The potential effect of programming by defining multiple files, each an independent class with multiple methods, is another profound change in our programming style. Generally, we refer to this approach as modular programming. We independently develop and debug methods for an application and then utilize them at any later time. In this section, we will consider numerous illustrative examples to help you get used to the idea. However, there are several details about the process that we need to discuss before considering more examples.
The public keyword
We have been identifying every static method as public since HelloWorld. This modifier identifies the method as available for use by any other program with access to the file. You can also identify methods as private (and there are a few other categories), but you have no reason to do so at this point. We will discuss various options in SECTION 3.3.
Each module is a class
We use the term module to refer to all the code that we keep in a single file. In Java, by convention, each module is a Java class that is kept in a file with the same name of the class but has a .java extension. In this chapter, each class is merely a set of static methods (one of which is main()). You will learn much more about the general structure of the Java class in CHAPTER 3.
The .class file
When you compile the program (by typing javac followed by the class name), the Java compiler makes a file with the class name followed by a .class extension that has the code of your program in a language more suited to your computer. If you have a .class file, you can use the module’s methods in another program even without having the source code in the corresponding .java file (but you are on your own if you discover a bug!).
Compile when necessary
When you compile a program, Java typically compiles everything that needs to be compiled in order to run that program. If you call Gaussian.cdf() in SAT, then, when you type javac SAT.java, the compiler will also check whether you modified Gaussian.java since the last time it was compiled (by checking the time it was last changed against the time Gaussian.class was created). If so, it will also compile Gaussian.java! If you think about this approach, you will agree that it is actually quite helpful. After all, if you find a bug in Gaussian.java (and fix it), you want all the classes that call methods in Gaussian to use the new version.
Multiple main() methods
Another subtle point is to note that more than one class might have a main() method. In our example, both SAT and Gaussian have their own main() method. If you recall the rule for executing a program, you will see that there is no confusion: when you type java followed by a class name, Java transfers control to the machine code corresponding to the main() method defined in that class. Typically, we include a main() method in every class, to test and debug its methods. When we want to run SAT, we type java SAT; when we want to debug Gaussian, we type java Gaussian (with appropriate command-line arguments).
IF YOU THINK OF EACH PROGRAM that you write as something that you might want to make use of later, you will soon find yourself with all sorts of useful tools. Modular programming allows us to view every solution to a computational problem that we may develop as adding value to our computational environment.
For example, suppose that you need to evaluate Φ for some future application. Why not just cut and paste the code that implements cdf() from Gaussian? That would work, but would leave you with two copies of the code, making it more difficult to maintain. If you later want to fix or improve this code, you would need to do so in both copies. Instead, you can just call Gaussian.cdf(). Our implementations and uses of our methods are soon going to proliferate, so having just one copy of each is a worthy goal.
From this point forward, you should write every program by identifying a reasonable way to divide the computation into separate parts of a manageable size and implementing each part as if someone will want to use it later. Most frequently, that someone will be you, and you will have yourself to thank for saving the effort of rewriting and re-debugging code.
Libraries
We refer to a module whose methods are primarily intended for use by many other programs as a library. One of the most important characteristics of programming in Java is that thousands of libraries have been predefined for your use. We reveal information about those that might be of interest to you throughout the book, but we will postpone a detailed discussion of the scope of Java libraries, because many of them are designed for use by experienced programmers. Instead, we focus in this chapter on the even more important idea that we can build user-defined libraries, which are nothing more than classes that contain a set of related methods for use by other programs. No Java library can contain all the methods that we might need for a given computation, so this ability to create our own library of methods is a crucial step in addressing complex programming applications.
Clients
We use the term client to refer to a program that calls a given library method. When a class contains a method that is a client of a method in another class, we say that the first class is a client of the second class. In our example, Gaussian is a client of SAT. A given class might have multiple clients. For example, all of the programs that you have written that call Math.sqrt() or Math.random() are clients of Math.
APIs
Programmers normally think in terms of a contract between the client and the implementation that is a clear specification of what the method is to do. When you are writing both clients and implementations, you are making contracts with yourself, which by itself is helpful because it provides extra help in debugging. More important, this approach enables code reuse. You have been able to write programs that are clients of Std* and Math and other built-in Java classes because of an informal contract (an English-language description of what they are supposed to do) along with a precise specification of the signatures of the methods that are available for use. Collectively, this information is known as an application programming interface (API). This same mechanism is effective for user-defined libraries. The API allows any client to use the library without having to examine the code in the implementation, as you have been doing for Math and Std*. The guiding principle in API design is to provide to clients the methods they need and no others. An API with a huge number of methods may be a burden to implement; an API that is lacking important methods may be unnecessarily inconvenient for clients.
Implementations
We use the term implementation to describe the Java code that implements the methods in an API, kept by convention in a file with the library name and a .java extension. Every Java program is an implementation of some API, and no API is of any use without some implementation. Our goal when developing an implementation is to honor the terms of the contract. Often, there are many ways to do so, and separating client code from implementation code gives us the freedom to substitute new and improved implementations.
FOR EXAMPLE, CONSIDER THE GAUSSIAN DISTRIBUTION functions. These do not appear in Java’s Math library but are important in applications, so it is worthwhile for us to put them in a library where they can be accessed by future client programs and to articulate this API:
API for our library of static methods for Gaussian distribution functions
public class Gaussian | |
double pdf(double x) |
ϕ(x) |
double pdf(double x, double mu, double sigma) |
ϕ(x,μ,σ) |
double cdf(double z) |
ϕ(z) |
double cdf(double z, double mu, double sigma) |
ϕ(z,μ,σ) |
The API includes not only the one-argument Gaussian distribution functions that we have previously considered (see PROGRAM 2.1.2) but also three-argument versions (in which the client specifies the mean and standard deviation of the distribution) that arise in many statistical applications. Implementing the three-argument Gaussian distribution functions is straightforward (see EXERCISE 2.2.1).
How much information should an API contain? This is a gray area and a hotly debated issue among programmers and computer-science educators. We might try to put as much information as possible in the API, but (as with any contract!) there are limits to the amount of information that we can productively include. In this book, we stick to a principle that parallels our guiding design principle: provide to client programmers the information they need and no more. Doing so gives us vastly more flexibility than the alternative of providing detailed information about implementations. Indeed, any extra information amounts to implicitly extending the contract, which is undesirable. Many programmers fall into the bad habit of checking implementation code to try to understand what it does. Doing so might lead to client code that depends on behavior not specified in the API, which would not work with a new implementation. Implementations change more often than you might think. For example, each new release of Java contains many new implementations of library functions.
Often, the implementation comes first. You might have a working module that you later decide would be useful for some task, and you can just start using its methods in other programs. In such a situation, it is wise to carefully articulate the API at some point. The methods may not have been designed for reuse, so it is worthwhile to use an API to do such a design (as we did for Gaussian).
The remainder of this section is devoted to several examples of libraries and clients. Our purpose in considering these libraries is twofold. First, they provide a richer programming environment for your use as you develop increasingly sophisticated client programs of your own. Second, they serve as examples for you to study as you begin to develop libraries for your own use.
Random numbers
We have written several programs that use Math.random(), but our code often uses particular idioms that convert the random double values between 0 and 1 that Math.random() provides to the type of random numbers that we want to use (random boolean values or random int values in a specified range, for example). To effectively reuse our code that implements these idioms, we will, from now on, use the StdRandom library in PROGRAM 2.2.1. StdRandom uses overloading to generate random numbers from various distributions. You can use any of them in the same way that you use our standard I/O libraries (see the first Q&A at the end of SECTION 2.1). As usual, we summarize the methods in our StdRandom library with an API:
API for our library of static methods for random numbers
public class StdRandom | |
void setSeed(long seed) |
set the seed for reproducible results |
int uniform(int n) |
integer between 0 and n-1 |
double uniform(double lo, double hi) |
floating-point number between lo and hi |
boolean bernoulli(double p) |
true with probability p, false otherwise |
double gaussian() |
Gaussian, mean 0, standard deviation 1 |
double gaussian(double mu, double sigma) |
Gaussian, mean mu, standard deviation sigma |
int discrete(double[] p) |
i with probability p[i] |
void shuffle(double[] a) |
randomly shuffle the array a[] |
These methods are sufficiently familiar that the short descriptions in the API suffice to specify what they do. By collecting all of these methods that use Math.random() to generate random numbers of various types in one file (StdRandom.java), we concentrate our attention on generating random numbers to this one file (and reuse the code in that file) instead of spreading them through every program that uses these methods. Moreover, each program that uses one of these methods is clearer than code that calls Math.random() directly, because its purpose for using Math.random() is clearly articulated by the choice of method from StdRandom.
API design
We make certain assumptions about the values passed to each method in StdRandom. For example, we assume that clients will call uniform(n) only for positive integers n, bernoulli(p) only for p between 0 and 1, and discrete() only for an array whose elements are between 0 and 1 and sum to 1. All of these assumptions are part of the contract between the client and the implementation. We strive to design libraries such that the contract is clear and unambiguous and to avoid getting bogged down with details. As with many tasks in programming, a good API design is often the result of several iterations of trying and living with various possibilities. We always take special care in designing APIs, because when we change an API we might have to change all clients and all implementations. Our goal is to articulate what clients can expect separate from the code in the API. This practice frees us to change the code, and perhaps to use an implementation that achieves the desired effect more efficiently or with more accuracy.
Program 2.2.1 Random number library
public class StdRandom
{
public static int uniform(int n)
{ return (int) (Math.random() * n); }
public static double uniform(double lo, double hi)
{ return lo + Math.random() * (hi - lo); }
public static boolean bernoulli(double p)
{ return Math.random() < p; }
public static double gaussian()
{ /* See Exercise 2.2.17. */ }
public static double gaussian(double mu, double sigma)
{ return mu + sigma * gaussian(); }
public static int discrete(double[] probabilities)
{ /* See Program 1.6.2. */ }
public static void shuffle(double[] a)
{ /* See Exercise 2.2.4. */ }
public static void main(String[] args)
{ /* See text. */ }
}
The methods in this library compute various types of random numbers: random nonnegative integer less than a given value, uniformly distributed in a given range, random bit (Bernoulli), standard Gaussian, Gaussian with given mean and standard deviation, and distributed according to a given discrete distribution.
% java StdRandom 5
90 26.36076 false 8.79269 0
13 18.02210 false 9.03992 1
58 56.41176 true 8.80501 0
29 16.68454 false 8.90827 0
85 86.24712 true 8.95228 0
Unit testing
Even though we implement StdRandom without reference to any particular client, it is good programming practice to include a test client main() that, although not used when a client class uses the library, is helpful when debugging and testing the methods in the library. Whenever you create a library, you should include a main() method for unit testing and debugging. Proper unit testing can be a significant programming challenge in itself (for example, the best way of testing whether the methods in StdRandom produce numbers that have the same characteristics as truly random numbers is still debated by experts). At a minimum, you should always include a main() method that
Exercises all the code
Provides some assurance that the code is working
Takes an argument from the command line to allow more testing
Then, you should refine that main() method to do more exhaustive testing as you use the library more extensively. For example, we might start with the following code for StdRandom (leaving the testing of shuffle() for an exercise):
public static void main(String[] args)
{
int n = Integer.parseInt(args[0]);
double[] probabilities = { 0.5, 0.3, 0.1, 0.1 };
for (int i = 0; i < n; i++)
{
StdOut.printf(" %2d " , uniform(100));
StdOut.printf("%8.5f ", uniform(10.0, 99.0));
StdOut.printf("%5b " , bernoulli(0.5));
StdOut.printf("%7.5f ", gaussian(9.0, 0.2));
StdOut.printf("%2d " , discrete(probabilities));
StdOut.println();
}
}
When we include this code in StdRandom.java and invoke this method as illustrated in PROGRAM 2.2.1, the output includes no surprises: the integers in the first column might be equally likely to be any value from 0 to 99; the numbers in the second column might be uniformly spread between 10.0 and 99.0; about half of the values in the third column are true; the numbers in the fourth column seem to average about 9.0, and seem unlikely to be too far from 9.0; and the last column seems to be not far from 50% 0s, 30% 1s, 10% 2s, and 10% 3s. If something seems amiss in one of the columns, we can type java StdRandom 10 or 100 to see many more results. In this particular case, we can (and should) do far more extensive testing in a separate client to check that the numbers have many of the same properties as truly random numbers drawn from the cited distributions (see EXERCISE 2.2.3). One effective approach is to write test clients that use StdDraw, as data visualization can be a quick indication that a program is behaving as intended. For example, a plot of a large number of points whose x- and y-coordinates are both drawn from various distributions often produces a pattern that gives direct insight into the important properties of the distribution. More important, a bug in the random number generation code is likely to show up immediately in such a plot.
Stress testing
An extensively used library such as StdRandom should also be subjected to stress testing, where we make sure that it does not crash when the client does not follow the contract or makes some assumption that is not explicitly covered. Java libraries have already been subjected to such stress testing, which requires carefully examining each line of code and questioning whether some condition might cause a problem. What should discrete() do if the array elements do not sum to exactly 1? What if the argument is an array of length 0? What should the two-argument uniform() do if one or both of its arguments is NaN? Infinity? Any question that you can think of is fair game. Such cases are sometimes referred to as corner cases. You are certain to encounter a teacher or a supervisor who is a stickler about corner cases. With experience, most programmers learn to address them early, to avoid an unpleasant bout of debugging later. Again, a reasonable approach is to implement a stress test as a separate client.
Input and output for arrays
We have seen—and will continue to see—many examples where we wish to keep data in arrays for processing. Accordingly, it is useful to build a library that complements StdIn and StdOut by providing static methods for reading arrays of primitive types from standard input and printing them to standard output. The following API provides these methods:
API for our library of static methods for array input and output
public class StdArrayIO | |
double[] readDouble1D() |
read a one-dimensional array of double values |
double[][] readDouble2D() |
read a two-dimensional array of double values |
void print(double[] a) |
print a one-dimensional array of double values |
void print(double[][] a) |
print a two-dimensional array of double values |
Note 1. 1D format is an integer n followed by n values. | |
Note 2. 2D format is two integers m and n followed by m × n values in row-major order. | |
Note 3. Methods for int and boolean are also included. |
The first two notes at the bottom of the table reflect the idea that we need to settle on a file format. For simplicity and harmony, we adopt the convention that all values appearing in standard input include the dimension(s) and appear in the order indicated. The read*() methods expect input in this format; the print() methods produce output in this format. The third note at the bottom of the table indicates that StdArrayIO actually contains 12 methods—four each for int, double, and boolean. The print() methods are overloaded (they all have the same name print() but different types of arguments), but the read*() methods need different names, formed by adding the type name (capitalized, as in StdIn) followed by 1D or 2D.
Implementing these methods is straightforward from the array-processing code that we have considered in SECTION 1.4 and in SECTION 2.1, as shown in StdArrayIO (PROGRAM 2.2.2). Packaging up all of these static methods into one file—StdArrayIO.java—allows us to easily reuse the code and saves us from having to worry about the details of reading and printing arrays when writing client programs later on.
Program 2.2.2 Array I/O library
public class StdArrayIO
{
public static double[] readDouble1D()
{ /* See Exercise 2.2.11. */ }
public static double[][] readDouble2D()
{
int m = StdIn.readInt();
int n = StdIn.readInt();
double[][] a = new double[m][n];
for (int i = 0; i < m; i++)
for (int j = 0; j < n; j++)
a[i][j] = StdIn.readDouble();
return a;
public static void print(double[] a)
{ /* See Exercise 2.2.11. */ }
public static void print(double[][] a)
{
int m = a.length;
int n = a[0].length;
System.out.println(m + " " + n);
for (int i = 0; i < m; i++)
{
for (int j = 0; j < n; j++)
StdOut.prinf("%9.5f ", a[i][j]);
StdOut.println();
}
StdOut.println();
}
// Methods for other types are similar (see booksite).
public static void main(String[] args)
{ print(readDouble2D()); }
}
% more tiny2D.txt
4 3
.000 .270 .000
.246 .224 -.036
.222 .176 .0893
-.032 .739 .270
% java StdArrayIO < tiny.txt
4 3
0.00000 0.27000 0.00000
0.24600 0.22400 -0.03600
0.22200 0.17600 0.08930
-0.03200 0.73900 0.27000
This library of static methods facilitates reading one-dimensional and two-dimensional arrays from standard input and printing them to standard output. The file format includes the dimensions (see accompanying text). Numbers in the output in the example are truncated.
Iterated function systems
Scientists have discovered that complex visual images can arise unexpectedly from simple computational processes. With StdRandom, StdDraw, and StdArrayIO, we can study the behavior of such systems.
Sierpinski triangle
As a first example, consider the following simple process: Start by plotting a point at one of the vertices of a given equilateral triangle. Then pick one of the three vertices at random and plot a new point halfway between the point just plotted and that vertex. Continue performing this same operation. Each time, we are pick a random vertex from the triangle to establish the line whose midpoint will be the next point plotted. Since we make random choices, the set of points should have some of the characteristics of random points, and that does seem to be the case after the first few iterations:
We can study the process for a large number of iterations by writing a program to plot trials points according to the rules:
double[] cx = { 0.000, 1.000, 0.500 };
double[] cy = { 0.000, 0.000, 0.866 };
double x = 0.0, y = 0.0;
for (int t = 0; t < trials; t++)
{
int r = StdRandom.uniform(3);
x = (x + cx[r]) / 2.0;
y = (y + cy[r]) / 2.0;
StdDraw.point(x, y);
}
We keep the x- and y-coordinates of the triangle vertices in the arrays cx[] and cy[], respectively. We use StdRandom.uniform() to choose a random index r into these arrays—the coordinates of the chosen vertex are (cx[r], cy[r]). The x-coordinate of the midpoint of the line from (x, y) to that vertex is given by the expression (x + cx[r])/2.0, and a similar calculation gives the y-coordinate. Adding a call to StdDraw.point() and putting this code in a loop completes the implementation. Remarkably, despite the randomness, the same figure always emerges after a large number of iterations! This figure is known as the Sierpinski triangle (see EXERCISE 2.3.27). Understanding why such a regular figure should arise from such a random process is a fascinating question.
Barnsley fern
To add to the mystery, we can produce pictures of remarkable diversity by playing the same game with different rules. One striking example is known as the Barnsley fern. To generate it, we use the same process, but this time driven by the following table of formulas. At each step, we choose the formulas to use to update x and y with the indicated probability (1% of the time we use the first pair of formulas, 85% of the time we use the second pair of formulas, and so forth).
probability |
x-update |
y-update |
1% |
x = 0.500 |
y = 0.16y |
85% |
x = 0.85x + 0.04y + 0.075 |
y = −0.04x + 0.85y + 0.180 |
7% |
x = 0.20x − 0.26y + 0.400 |
y = 0.23x + 0.22y + 0.045 |
7% |
x = −0.15x + 0.28y + 0.575 |
y = 0.26x + 0.24y − 0.086 |
Program 2.2.3 Iterated function systems
public class IFS
{
public static void main(String[] args)
{ // Plot trials iterations of IFS on StdIn.
int trials = Integer.parseInt(args[0]);
double[] dist = StdArrayIO.readDouble1D();
double[][] cx = StdArrayIO.readDouble2D();
double[][] cy = StdArrayIO.readDouble2D();
double x = 0.0, y = 0.0;
for (int t = 0; t < trials; t++)
{ // Plot 1 iteration.
int r = StdRandom.discrete(dist);
double x0 = cx[r][0]*x + cx[r][1]*y + cx[r][2];
double y0 = cy[r][0]*x + cy[r][1]*y + cy[r][2];
x = x0;
y = y0;
StdDraw.point(x, y);
}
}
}
trials | iterations
dist[] | probabilities
cx[][] | x coefficients
cy[][] | y coefficients
x, y | current point
This data-driven client of StdArrayIO, StdRandom, and StdDraw iterates the function system defined by a 1-by-m vector (probabilities) and two m-by-3 matrices (coefficients for updating x and y, respectively) on standard input, plotting the result as a set of points on standard drawing. Curiously, this code does not need to know the value of m, as it uses separate methods to create and process the matrices.
% more sierpinski.txt
3
.33 .33 .34
3 3
.50 .00 .00
.50 .00 .50
.50 .00 .25
3 3
.00 .50 .00
.00 .50 .00
.00 .50 .433
We could write code just like the code we just wrote for the Sierpinski triangle to iterate these rules, but matrix processing provides a uniform way to generalize that code to handle any set of rules. We have m different transformations, chosen from a 1-by-m vector with StdRandom.discrete(). For each transformation, we have an equation for updating x and an equation for updating y, so we use two m-by-3 matrices for the equation coefficients, one for x and one for y. IFS (PROGRAM 2.2.3) implements this data-driven version of the computation. This program enables limitless exploration: it performs the iteration for any input containing a vector that defines the probability distribution and the two matrices that define the coefficients, one for updating x and the other for updating y. For the coefficients just given, again, even though we choose a random equation at each step, the same figure emerges every time that we do this computation: an image that looks remarkably similar to a fern that you might see in the woods, not something generated by a random process on a computer.
That the same short program that takes a few numbers from standard input and plots points on standard drawing can (given different data) produce both the Sierpinski triangle and the Barnsley fern (and many, many other images) is truly remarkable. Because of its simplicity and the appeal of the results, this sort of calculation is useful in making synthetic images that have a realistic appearance in computer-generated movies and games.
Perhaps more significantly, the ability to produce such realistic diagrams so easily suggests intriguing scientific questions: What does computation tell us about nature? What does nature tell us about computation?
Statistics
Next, we consider a library for a set of mathematical calculations and basic visualization tools that arise in all sorts of applications in science and engineering and are not all implemented in standard Java libraries. These calculations relate to the task of understanding the statistical properties of a set of numbers. Such a library is useful, for example, when we perform a series of scientific experiments that yield measurements of a quantity. One of the most important challenges facing modern scientists is proper analysis of such data, and computation is playing an increasingly important role in such analysis. These basic data analysis methods that we will consider are summarized in the following API:
API for our library of static methods for data analysis
public class StdStats | |
double max(double[] a) |
largest value |
double min(double[] a) |
smallest value |
double mean(double[] a) |
average |
double var(double[] a) |
sample variance |
double stddev(double[] a) |
sample standard deviation |
double median(double[] a) |
median |
void plotPoints(double[] a) |
plot points at (i, a[i]) |
void plotLines(double[] a) |
plot lines connecting points at (i, a[i]) |
void plotBars(double[] a) |
plot bars to points at (i, a[i]) |
Note: Overloaded implementations are included for other numeric types. |
Basic statistics
Suppose that we have n measurements x0, x1, ..., xn–1. The average value of those measurements, otherwise known as the mean, is given by the formula μ = (x0 + x1 + ... + xn–1) / n and is an estimate of the value of the quantity. The minimum and maximum values are also of interest, as is the median (the value that is smaller than and larger than half the values). Also of interest is the sample variance, which is given by the formula
σ2 = ( ( x0– μ)2 + (x1 – μ)2 + ... + (xn–1 – μ)2) / (n–1)
Program 2.2.4 Data analysis library
public class StdStats
{
public static double max(double[] a)
{ // Compute maximum value in a[].
double max = Double.NEGATIVE_INFINITY;
for (int i = 0; i < a.length; i++)
if (a[i] > max) max = a[i];
return max;
}
public static double mean(double[] a)
{ // Compute the average of the values in a[].
double sum = 0.0;
for (int i = 0; i < a.length; i++)
sum = sum + a[i];
return sum / a.length;
}
public static double var(double[] a)
{ // Compute the sample variance of the values in a[].
double avg = mean(a);
double sum = 0.0;
for (int i = 0; i < a.length; i++)
sum += (a[i] - avg) * (a[i] - avg);
return sum / (a.length - 1);
}
public static double stddev(double[] a)
{ return Math.sqrt(var(a)); }
// See Program 2.2.5 for plotting methods.
public static void main(String[] args)
{ /* See text. */ }
}
This code implements methods to compute the maximum, mean, variance, and standard deviation of numbers in a client array. The method for computing the minimum is omitted; plotting methods are in PROGRAM 2.2.5; see EXERCISE 4.2.20 for median().
% more tiny1D.txt
5
3.0 1.0 2.0 5.0 4.0
% java StdStats < tiny1D.txt
min <1.000
mean 3.000
max 5.000
std dev 1.581
and the sample standard deviation, the square root of the sample variance. StdStats (PROGRAM 2.2.4) shows implementations of static methods for computing these basic statistics (the median is more difficult to compute than the others—we will consider the implementation of median() in SECTION 4.2). The main() test client for StdStats reads numbers from standard input into an array and calls each of the methods to print the minimum, mean, maximum, and standard deviation, as follows:
public static void main(String[] args)
{
double[] a = StdArrayIO.readDouble1D();
StdOut.printf(" min %7.3f\n", min(a));
StdOut.printf(" mean %7.3f\n", mean(a));
StdOut.printf(" max %7.3f\n", max(a));
StdOut.printf(" std dev %7.3f\n", stddev(a));
}
As with StdRandom, a more extensive test of the calculations is called for (see EXERCISE 2.2.3). Typically, as we debug or test new methods in the library, we adjust the unit testing code accordingly, testing the methods one at a time. A mature and widely used library like StdStats also deserves a stress-testing client for extensively testing everything after any change. If you are interested in seeing what such a client might look like, you can find one for StdStats on the booksite. Most experienced programmers will advise you that any time spent doing unit testing and stress testing will more than pay for itself later.
Plotting
One important use of StdDraw is to help us visualize data rather than relying on tables of numbers. In a typical situation, we perform experiments, save the experimental data in an array, and then compare the results against a model, perhaps a mathematical function that describes the data. To expedite this process for the typical case where values of one variable are equally spaced, our StdStats library contains static methods that you can use for plotting data in an array. PROGRAM 2.2.5 is an implementation of the plotPoints(), plotLines(), and plotBars() methods for StdStats. These methods display the values in the argument array at evenly spaced intervals in the drawing window, either connected together by line segments (lines), filled circles at each value (points), or bars from the x-axis to the value (bars). They all plot the points with x-coordinate i and y-coordinate a[i] using filled circles, lines through the points, and bars, respectively. In addition, they all rescale x to fill the drawing window (so that the points are evenly spaced along the x-coordinate) and leave to the client scaling of the y-coordinates.
Program 2.2.5 Plotting data values in an array
public static void plotPoints(double[] a)
{ // Plot points at (i, a[i]).
int n = a.length;
StdDraw.setXscale(-1, n);
StdDraw.setPenRadius(1/(3.0*n));
for (int i = 0; i < n; i++)
StdDraw.point(i, a[i]);
}
public static void plotLines(double[] a)
{ // Plot lines through points at (i, a[i]).
int n = a.length;
StdDraw.setXscale(-1, n);
StdDraw.setPenRadius();
for (int i = 1; i < n; i++)
StdDraw.line(i-1, a[i-1], i, a[i]);
}
public static void plotBars(double[] a)
{ // Plot bars from (0, a[i]) to (i, a[i]).
int n = a.length;
StdDraw.setXscale(-1, n);
for (int i = 0; i < n; i++)
StdDraw.filledRectangle(i, a[i]/2, 0.25, a[i]/2);
}
This code implements three methods in StdStats (PROGRAM 2.2.4) for plotting data. They plot the points (i, a[i]) with filled circles, connecting line segments, and bars, respectively.
These methods are not intended to be a general-purpose plotting package, but you can certainly think of all sorts of things that you might want to add: different types of spots, labeled axes, color, and many other artifacts are commonly found in modern systems that can plot data. Some situations might call for more complicated methods than these.
Our intent with StdStats is to introduce you to data analysis while showing you how easy it is to define a library to take care of useful tasks. Indeed, this library has already proved useful—we use these plotting methods to produce the figures in this book that depict function graphs, sound waves, and experimental results. Next, we consider several examples of their use.
Plotting function graphs
You can use the StdStats.plot*() methods to draw a plot of the function graph for any function at all: choose an x-interval where you want to plot the function, compute function values evenly spaced through that interval and store them in an array, determine and set the y-scale, and then call StdStats.plotLines() or another plot*() method. For example, to plot a sine function, rescale the y-axis to cover values between –1 and +1. Scaling the x-axis is automatically handled by the StdStats methods. If you do not know the range, you can handle the situation by calling:
StdDraw.setYscale(StdStats.min(a), StdStats.max(a));
The smoothness of the curve is determined by properties of the function and by the number of points plotted. As we discussed when first considering StdDraw, you have to be careful to sample enough points to catch fluctuations in the function. We will consider another approach to plotting functions based on sampling values that are not equally spaced in SECTION 2.4.
Plotting sound waves
Both the StdAudio library and the StdStats plot methods work with arrays that contain sampled values at regular intervals. The diagrams of sound waves in SECTION 1.5 and at the beginning of this section were all produced by first scaling the y-axis with StdDraw.setYscale(-1, 1), then plotting the points with StdStats.plotPoints(). As you have seen, such plots give direct insight into processing audio. You can also produce interesting effects by plotting sound waves as you play them with StdAudio, although this task is a bit challenging because of the huge amount of data involved (see EXERCISE 1.5.23).
Plotting experimental results
You can put multiple plots on the same drawing. One typical reason to do so is to compare experimental results with a theoretical model. For example, Bernoulli (PROGRAM 2.2.6) counts the number of heads found when a fair coin is flipped n times and compares the result with the predicted Gaussian probability density function. A famous result from probability theory is that the distribution of this quantity is the binomial distribution, which is extremely well approximated by the Gaussian distribution with mean n/2 and standard deviation . The more trials we perform, the more accurate the approximation. The drawing produced by Bernoulli is a succinct summary of the results of the experiment and a convincing validation of the theory. This example is prototypical of a scientific approach to applications programming that we use often throughout this book and that you should use whenever you run an experiment. If a theoretical model that can explain your results is available, a visual plot comparing the experiment to the theory can validate both.
THESE FEW EXAMPLES ARE INTENDED TO suggest what is possible with a well-designed library of static methods for data analysis. Several extensions and other ideas are explored in the exercises. You will find StdStats to be useful for basic plots, and you are encouraged to experiment with these implementations and to modify them or to add methods to make your own library that can draw plots of your own design. As you continue to address an ever-widening circle of programming tasks, you will naturally be drawn to the idea of developing tools like these for your own use.
Program 2.2.6 Bernoulli trials
public class Bernoulli
{
public static int binomial(int n)
{ // Simulate flipping a coin n times; return # heads.
int heads = 0;
for (int i = 0; i < n; i++)
if (StdRandom.bernoulli(0.5)) heads++;
return heads;
}
public static void main(String[] args)
{ // Perform Bernoulli trials, plot results and model.
int n = Integer.parseInt(args[0]);
int trials = Integer.parseInt(args[1]);
int[] freq = new int[n+1];
for (int t = 0; t < trials; t++)
freq[binomial(n)]++;
double[] norm = new double[n+1];
for (int i = 0; i <= n; i++)
norm[i] = (double) freq[i] / trials;
StdStats.plotBars(norm);
double mean = n / 2.0;
double stddev = Math.sqrt(n) / 2.0;
double[] phi = new double[n+1];
for (int i = 0; i <= n; i++)
phi[i] = Gaussian.pdf(i, mean, stddev);
StdStats.plotLines(phi);
}
}
n | number of flips per trial
trials | number of trials
freq[] | experimental results
norm[] | normalized results
phi[] | Gaussian model
This StdStats, StdRandom, and Gaussian client provides visual evidence that the number of heads observed when a fair coin is flipped n times obeys a Gaussian distribution.
Modular programming
The library implementations that we have developed illustrate a programming style known as modular programming. Instead of writing a new program that is self-contained within its own file to address a new problem, we break up each task into smaller, more manageable subtasks, then implement and independently debug code that addresses each subtask. Good libraries facilitate modular programming by allowing us to define and provide solutions for important subtasks for future clients. Whenever you can clearly separate tasks within a program, you should do so. Java supports such separation by allowing us to independently debug and later use classes in separate files. Traditionally, programmers use the term module to refer to code that can be compiled and run independently; in Java, each class is a module.
IFS (PROGRAM 2.2.3) exemplifies modular programming. This relatively sophisticated computation is implemented with several relatively small modules, developed independently. It uses StdRandom and StdArrayIO, as well as the methods from Integer and StdDraw that we are accustomed to using. If we were to put all of the code required for IFS in a single file, we would have a large amount of code on our hands to maintain and debug; with modular programming, we can study iterated function systems with some confidence that the arrays are read properly and that the random number generator will produce properly distributed values, because we already implemented and tested the code for these tasks in separate modules.
Summary of classes in this section
API |
description |
|
Gaussian distribution functions |
|
random numbers |
|
input and output for arrays |
|
client for iterated function systems |
|
functions for data analysis |
|
client for Bernoulli trials |
Similarly, Bernoulli (PROGRAM 2.2.6) exemplifies modular programming. It is a client of Gaussian, Integer, Math, StdRandom, and StdStats. Again, we can have some confidence that the methods in these modules produce the expected results because they are system libraries or libraries that we have tested, debugged, and used before.
To describe the relationships among modules in a modular program, we often draw a dependency graph, where we connect two class names with an arrow labeled with the name of a method if the first class contains a method call and the second class contains the definition of the method. Such diagrams play an important role because understanding the relationships among modules is necessary for proper development and maintenance.
We emphasize modular programming throughout this book because it has many important advantages that have come to be accepted as essential in modern programming, including the following:
We can have programs of a reasonable size, even in large systems.
Debugging is restricted to small pieces of code.
We can reuse code without having to re-implement it.
Maintaining (and improving) code is much simpler.
The importance of these advantages is difficult to overstate, so we will expand upon each of them.
Programs of a reasonable size
No large task is so complex that it cannot be divided into smaller subtasks. If you find yourself with a program that stretches to more than a few pages of code, you must ask yourself the following questions: Are there subtasks that could be implemented separately? Could some of these subtasks be logically grouped together in a separate library? Could other clients use this code in the future? At the other end of the range, if you find yourself with a huge number of tiny modules, you must ask yourself questions such as these: Is there some group of subtasks that logically belong in the same module? Is each module likely to be used by multiple clients? There is no hard-and-fast rule on module size: one implementation of a critically important abstraction might properly be a few lines of code, whereas another library with a large number of overloaded methods might properly stretch to hundreds of lines of code.
Debugging
Tracing a program rapidly becomes more difficult as the number of statements and interacting variables increases. Tracing a program with hundreds of variables requires keeping track of hundreds of values, as any statement might affect or be affected by any variable. To do so for hundreds or thousands of statements or more is untenable. With modular programming and our guiding principle of keeping the scope of variables local to the extent possible, we severely restrict the number of possibilities that we have to consider when debugging. Equally important is the idea of a contract between client and implementation. Once we are satisfied that an implementation is meeting its end of the bargain, we can debug all its clients under that assumption.
Code reuse
Once we have implemented libraries such as StdStats and StdRandom, we do not have to worry about writing code to compute averages or standard deviations or to generate random numbers again—we can simply reuse the code that we have written. Moreover, we do not need to make copies of the code: any module can just refer to any public method in any other module.
Maintenance
Like a good piece of writing, a good program can always be improved, and modular programming facilitates the process of continually improving your Java programs because improving a module improves all of its clients. For example, it is normally the case that there are several different approaches to solving a particular problem. With modular programming, you can implement more than one and try them independently. More importantly, suppose that while developing a new client, you find a bug in some module. With modular programming, fixing that bug essentially fixes bugs in all of the module’s clients.
IF YOU ENCOUNTER AN OLD PROGRAM (or a new program written by an old programmer!), you are likely to find one huge module—a long sequence of statements, stretching to several pages or more, where any statement can refer to any variable in the program. Old programs of this kind are found in critical parts of our computational infrastructure (for example, some nuclear power plants and some banks) precisely because the programmers charged with maintaining them cannot even understand them well enough to rewrite them in a modern language! With support for modular programming, modern languages like Java help us avoid such situations by separately developing libraries of methods in independent classes.
The ability to share static methods among different files fundamentally extends our programming model in two different ways. First, it allows us to reuse code without having to maintain multiple copies of it. Second, by allowing us to organize a program into files of manageable size that can be independently debugged and compiled, it strongly supports our basic message: whenever you can clearly separate tasks within a program, you should do so.
In this section, we have supplemented the Std* libraries of SECTION 1.5 with several other libraries that you can use: Gaussian, StdArrayIO, StdRandom, and StdStats. Furthermore, we have illustrated their use with several client programs. These tools are centered on basic mathematical concepts that arise in any scientific project or engineering task. Our intent is not just to provide tools, but also to illustrate that it is easy to create your own tools. The first question that most modern programmers ask when addressing a complex task is “Which tools do I need?” When the needed tools are not conveniently available, the second question is “How difficult would it be to implement them?” To be a good programmer, you need to have the confidence to build a software tool when you need it and the wisdom to know when it might be better to seek a solution in a library.
After libraries and modular programming, you have one more step to learn a complete modern programming model: object-oriented programming, the topic of CHAPTER 3. With object-oriented programming, you can build libraries of functions that use side effects (in a tightly controlled manner) to vastly extend the Java programming model. Before moving to object-oriented programming, we consider in this chapter the profound ramifications of the idea that any method can call itself (in SECTION 2.3) and a more extensive case study (in SECTION 2.4) of modular programming than the small clients in this section.