Matrix Algebra Methods
Suppose that you took regression’s job as your own, in a situation that called for you to predict the value of an outcome variable given knowledge of three predictor variables, named Var 1, Var 2, and Var 3. You decide to declare, by fiat, that each predictor variable should be multiplied by a regression coefficient of 1. Then the regression equation would look like this:
(1 * Var 1) + (1 * Var 2) + (1 * Var 3) = Predicted variable
There is nothing to prevent you from doing that, but it’s wildly unlikely that the regression coefficients you chose, a sequence of 1s, will work better than any other coefficients that you might choose. Nevertheless, you will have completed a basic requirement of regression analysis: a sequence of variables, each multiplied by its regression coefficient and added together to create a new, composite variable.
For years, statistical packages such as Systat, and even more generalized applications such as Excel, used matrix algebra to solve regression’s normal equations. These processes failed to operate successfully when they were presented with data sets that involved severe multicollinearity. Multicollinearity comes about when two or more predictor variables in a regression equation are strongly or even perfectly correlated.
When this situation occurs, it can throw the results of the matrix algebra off course. Taking apart the matrix components of a multiple regression, you find that the process involves calculating the sums of squares and cross products matrix (SSCP). Then the inverse of the SSCP is calculated. If the values of one of the fields in the original data matrix is a linear function of another one of those fields, then the inverse of the SSCP cannot be calculated. (This is usually because the determinant of the SSCP is zero.)
This problem was known in the waning years of the previous century, but it went unfixed, largely because it took an unusual sequence of events for the problem to arise. Furthermore, the user who encountered the problem got an error warning, sometimes in the form of a lengthy text message, sometimes in the form such as Excel’s #NUM! cell value. So an opportunity existed for the user to recognize that an infrequent error had occurred, and to fix it in the data file.
But users did not like knowing of a remaining problem, however unusual, in their software, so developers applied an approach called QR decomposition in place of the existing matrix algebra. It’s the approach that you find in Excel and other numeric analysis packages even as late as this book’s publication in 2022.
However, QR decomposition does not truly fix the multicollinearity problem, which is not a strictly either/or situation. When one field is a nearly perfect linear function of another, problems can arise with rounding errors, and those errors can reduce the accuracy of the analysis results.
Some software publishers have adopted the reasonable solution of displaying a zero instead of a calculated regression coefficient when QR decomposition detects the presence of multicollinearity. This has the effect—possibly useful, possibly disastrous—of eliminating the associated field from the regression equation. Depending on the nature of the linear function, the regression software might set both the regression coefficient and its standard error to zero.
For the time being, though, let’s shift our attention to some of the critical elements of the quap function.