- Arrays
- Containers
- Iterators
- Case Study: Calculating Simple Statistics
- Case Study: Histograms
- What's Next
Case Study: Calculating Simple Statistics
The case studies throughout this book show C++ being used for practical purposes. In Chapter 2 you saw that the library functions provided are biased toward scientific and engineering applications, but it's not difficult to write your own functions for statistical analysis. The same kind of data analysis that you would use on meteorological records can be used to look at stock prices, because they both involve values that vary over time. In the following two case studies I particularly want to show you how the standard algorithms can make processing data easier. Appendix B, "A Short Library Reference," will give you more information about the algorithms available.
In this first case study, we will download historical data that gives the performance of the S&P 500 over the past year, using this format: the date, the name of the stock or ticker symbol, the opening price, the highest price in the day, the lowest price in the day, the final price, and the trading volume. The date is in the same order as the ISO standard, but without the hyphens. Everything is separated by commas, which is convenient for spreadsheets, although not for program input/output. All of the stock data is bundled together in one big 5MB file, and so extracting data for a particular ticker symbol is going to be necessary. I have put some sample data on the accompanying CD-ROM, but you can get up-to-date historical data for the last year from http://biz.swcp.com/stocks/. (Click on the `Get Full Set' button). The C++ source code for this case study will be found in chap3\stats.cpp, and the full year set of data is called SP500HST.TXT.
20000710,ABBA,67.75,70.25,67.6875,68.25,38349
The first task is to extract the stock of interest. Along the way, you will replace all commas with spaces. Although this is not a difficult operation to write, it has already been done for you to save some time. The replace() standard algorithm goes through any sequence, performing the required replacement. As you saw earlier in the chapter, strings are array-like, and although strings are not full containers, the basic algorithms will still work on them. Here is replace() in action:
;> int a[] = {1,2,0,23,0}; ;> show_arr(a,5); 1 2 0 23 0 ;> replace(a,a+5,0,-1); ;> show_arr(a,5); 1 2 -1 23 -1 ;> string s = "a line from a song"; ;> replace(s.begin(),s.end(),' `,'-'); ;> s; (string) s = `a-line-from-a-song'
Now that you have seen replace() in action, I can show you the function extract_stock(), which is given the stock ticker symbol. The first part reads each line until the line contains the symbol, or it runs out of data. If the symbol was found, then the second part reads each line, replaces commas with spaces, and writes the line out to another file. This continues as long as the line contains the ticker symbol.
bool extract_stock(string ofile,string stock) { ifstream in(SFILE.c_str()); string line; do { getline(in,line); } while (! in.eof() && line.find(stock) == string::npos); if (! in.eof()) { // found our stock! ofstream out(ofile.c_str()); do { replace(line.begin(),line.end(),',',' `); out << line << endl; getline(in,line); } while (! in.eof() && line.find(stock) != string::npos); return true; } else return false; } ;> extract_stock("yum.txt","YUM"); (bool) true
There is now a file YUM.TXT containing the S&P 500 data for the symbol YUM, without commas. (This will take a few seconds.) Next, you can easily read the values into some vectors; you don't know precisely how many trading days there were in the last 12-month period, so using push_back() is useful:
typedef vector<double> V; typedef V::iterator IV; bool read_any_stock(string data_file, V& oprices, V& lprices, V& hprices, V& fprices, V& volumes) { ifstream in; if (! in.open(data_file.c_str())) return false; double lprice,hprice,fprice,vol,f; string date,stock; while (in >> date >> stock >> oprice >> lprice >> hprice >> fprice >> vol) { oprices.push_back(oprice); lprices.push_back(lprice); hprices.push_back(hprice); fprices.push_back(fprice); volumes.push_back(vol); } return true; } V open_price, low_price, high_price, final_price, volume; bool read_stock(string dfile) { low_price.clear(); high_price.clear(); final_price.clear(); volume.clear(); return read_any_stock(dfile, low_price,high_price,final_price,volume); }
read_any_stock() is awkward to call, so you can define a helper function read_stock() that reads the values into global variables. Now the full year's prices are available for YUM; the first question is what the minimum and maximum prices have been.
The standard algorithms max_element() and min_element() save you the trouble of writing yet another loop to find the minimum and maximum values. These are not difficult operations to code, but they've already been done for your convenience. Note that these algorithms return an iterator that refers to the value and that the dereference operator (*) is needed to get the actual value.
;> read_stock("YUM.txt"); (bool) true ;> IV i1 = low_price.begin(), i2 = low_price.end(); *min_element(i1,i2); (double&) 23.875 *max_element(i1,i2); (double&) 47.64
The most basic statistic is a plain average value, and it is made easy by the accumulate() algorithm, which gives you the sum of the elements of a sequence. To get the average value, you simply need to divide this sum by the number of values:
;> double val = 0; ;> accumulate(i1,i2,val)/low_price.size(); (double) 56.4577 ;>
For time-series data like this, a plain average isn't very useful. Analysts are fond of moving averages, which smooth out the spikes and make the trends clearer. The moving average at any point is the average of the values of the neighboring points. The function moving_average() is passed an input, an output vector<double>, and a smoothing width. It works as follows: Define an interval with this width (sometimes called a smoothing window) and get the average value. Now move the interval along by one element and repeat. The series of average values generated by this moving interval is the moving average. Note how the interval is specified for accumulate(); The only thing to be careful about is how to handle the interval at both ends of the vector. Here I have used max() and min() to force the interval bounds to lie between 0 and n-1.
void moving_average(const V& vin, int width, V& vout) { int w2 = width/2, n = vin.size(); IV vstart = vin.begin(); vout.resize(n); for(int i = 0; i < n; i++) { int lower = max(0, i-w2); int upper = min(n-1,i+w2); double val = 0; val = accumulate(vstart+lower,vstart+upper,val); vout[i] = val/(upper - lower); } } ;> V v; ;> moving_average(low_price,10,v); ;> vplot(win,low_price.begin(),low_price.end(),true); ;> vplot(win,v.begin(),v.end(),false);
This example includes some plot-generating calls in stats.cpp because sometimes a picture is worth a thousand words (or is that 4KB?). The averaged vector is indeed much smoother than the raw data (see Figure 3.1). The source code for vplot() is in vplot.cpp; the first call is passed a Boolean argument of true to force vplot() to scale to the data; any subsquent calls would pass false so it will reuse the scaling.
Figure 3.1 The moving image is much smoother than the raw data.