Using Principal Components in Excel
Armed with raw data and some freeware, you can use Excel to produce some very interesting results, both numeric and graphic. This article describes a test that I ran, one that looks at the way states in different regions of the United States exhibit similar crime patterns.
The data set I used comes from the 1972 edition of the Statistical Abstract of the United States, which reports crime rates (number of crimes per 100,000 population) in each of the 50 states for 1970 and 1971. The crimes are classified according to seven categories: murder, rape, robbery, assault, burglary, larceny, and auto theft. The report also classifies the states into several regions: New England, Middle Atlantic, East North Central, West North Central, South Atlantic, East South Central, West South Central, Mountain, and Pacific.
Figure 1 shows a portion of the data, just to give you an idea of what it looks like.
Figure 1 Number of crimes per 100,000 population, 1970
I started by putting the crime rates through a custom Excel routine that calculates principal components. Principal components analysis (PCA) looks for components (also termed factors in factor analysis) that underlie the patterns of correlations among variables such as rates for different types of crimes. It can be more straightforward to examine 2 or 3 components instead of 7 to 10 original variables. Further, the original variables might combine in ways that make what's going on in the data much more clear.
In this case, I ran the data through the code in Factor.xls. Figure 2 shows a portion of the results.
Figure 2 Two principal components emerged from the initial analysis.
One of the goals for PCA is to reduce the number of variables that you have to work with. Ideally, you want to simplify things by reducing the number of factors that you lose in information by ignoring some factors.
In PCA there are several different approaches to deciding how many components to retain, including significance tests, scree tests, cross-validation, and size of eigenvalues. (Eigenvalues are a measure of the amount of variability in the original data set that is attributable to each component that PCA extracts.)
In this case, I used Kaiser's recommendation that only components with eigenvalues greater than 1.0 should be retained—and here that criterion means that only Factor 1 and Factor 2 should be kept. Figure 2 shows that only Factor 1 and Factor 2 have eigenvalues greater than 1.0.
The Factor Score Coefficients in Figure 2 are used to convert records' values on the original variables to factor scores. There are actually a couple of steps (which are done on your behalf by Factor.xls):
- Convert each record's values to z scores. Subtract the variable's mean value from each record's actual value and then divide the result by the standard deviation for the variable. Treat the records as a population: Use STDEVP() instead of STDEV(); or in Excel 2010, use STDEV.P() instead of STDEV.S().
- Multiply the z scores by the factor score coefficients and total the results.
Here's how it works out for the state of Maine in 1970.
Maine's original values on the seven crime rate variables are as follows:
1.5 7 12.6 62 562 1055 146
Maine's z scores—the original values less the average for each variable, divided by the variable's standard deviation:
-1.4065 -1.1844 -0.9848 -1.0879 -1.0318 -1.2646 -1.1227
The factor score coefficients for the first factor, from Figure 2 (in Figure 2, they are shown in a column but have been transposed here to occupy a row):
0.137 0.209 0.192 0.192 0.216 0.178 0.175
Now multiply the z scores by the factor score coefficients and total the results:
0.137 * -1.4065 + 0.209 * -1.1844 +...+ 0.175 * -1.1227 = -1.483
So Maine has a value of -1.483 on the first factor. I know this looks like a lot of work, but Factor.xls does it for you. It calculates the factor scores for each record and each factor if you give it individual records to work with. (You can supply a rectangular correlation matrix instead, but in that case you won't get individual factor scores.) I went through the arithmetic here just so you could see how it takes place.
Figure 3 shows the factor scores for each state on the first two factors: the ones I decided to keep because both their eigenvalues were greater than 1.0.
Figure 3 just shows a partial list. The full set of state scores is in the Excel Crime.xls workbook.
Figure 3 You can treat components (factors) just like variables. They are simply weighted combinations of the original variables.
Now you're in a position to chart these two "new" variables.
Figure 4 shows the chart.
Figure 4 The way the components are derived means that they aren't correlated.
The chart has two dimensions, each corresponding to one of the factors that were extracted. The way states are distributed in the chart is tantalizing. Notice that states in the south and southeast regions of the United States tend to locate toward the top of the vertical axis (North and South Carolina, Alabama, Georgia, Louisiana, Mississippi, Tennessee, Texas, and Arkansas). States in the western regions tend to locate toward the left, lower end of the horizontal axis (California, Arizona, Colorado, and Nevada). There are two problems, though:
- These are just tendencies. New York, Michigan, Maryland, and Florida, for example, also show up on the left end of the horizontal axis.
- We don't yet know what the dimensions—the factors or components—represent. We should be able to figure that out from how the original variables load on the factors, but a little more manipulation is likely to help.
That manipulation comes in the form of what's called rotation. We want to take the two perpendicular axes in the figure and rotate them, keeping them perpendicular to one another, in a way that makes the loadings more clear.
By the way, you might wonder how the chart in Figure 4 comes to show two-character state abbreviations as labels for each data point. It's an Excel scatter chart; scatter charts can show the plotted values as labels, not ancillary data—at least not automatically.
But if you have your data laid out in a fashion similar to that shown in Figure 3, you can run this simple VBA macro. It establishes data labels, positioned immediately above each data marker. Then it substitutes something such as two-character state codes for the data values that Excel automatically puts in data labels.
Sub LabelDataPoints() ActiveSheet.ChartObjects(1).Activate ActiveChart.SetElement (msoElementDataLabelTop) With ActiveChart.SeriesCollection(1) For i = 2 To 51 .Points(i - 1).DataLabel.Text = Sheet1.Cells(i, 1) Next i End With End Sub
The code makes some assumptions, but you can alter them to fit your own situation. Or you can make the assumptions alter themselves to fit the situation if you feel comfortable with VBA. In particular, the code I show above assumes the following:
- There's one chart on the active worksheet.
- The chart has one data series.
- There are 50 data points on the chart.
- The data labels you want to use are in A2:A51 on the active worksheet.