A Full Example
Let’s put some meat on these bones. Suppose that you’re interested in the relationship between body weight and LDL cholesterol levels. You have a simple, straightforward hypothesis that, other things being equal, there is a direct relationship between a person’s body weight and his or her LDL level. To take advantage of the tools that quap gives you, you’ll need an alist, one that looks something like the following code:
library(rethinking) adult.weight <- read.csv("Sample Weight Data.csv")
The read.csv statement attempts to open the file named Sample Weight Data.csv in the working directory. You can include the file’s path in the argument to read.csv; if you handle it that way, keep in mind that R uses forward slashes, not back slashes, to delimit folder names in file addresses. Or, you could save the data file in what you know to be R’s current working directory.
sample.mean.wt <- mean(adult.weight$Weight) ldl.model <- quap( alist( LDL ~ dnorm( mu , st.dev.wt ) , mu <- alpha + beta *( adult.weight$Weight - sample.mean.wt ) ,
I have given these two variables in mu’s definition the names of alpha and beta, because that’s how they are normally referred to in the literature on simple (i.e., not multiple: only one predictor) regression: alpha is the intercept and beta is the regression coefficient.
Now we need to establish the central tendency and the spread of the alpha and beta priors. We can tell quap that alpha, the equation’s intercept, has a mean of 20 and a standard deviation of 20:
alpha ~ dnorm( 20 , 20 ) ,
and that beta, the regression coefficient, has a mean of 0 and a standard deviation of 1:
beta ~ dnorm( 0 , 1 ) ,
and that the standard deviation of body weight follows a uniform distribution with a mean of 0 and its own standard deviation of 50:
st.dev.wt ~ dunif( 0 , 50 ) ) , data = adult.weight) precis(ldl.model)
And you’ll need a set of observations that are stored in the csv file Sample Weight Data.csv. They are temporarily stored by R in the structure named adult.weight. Here are the first few observations in adult.weight. Note that the first row of the csv file contains field names. The read.csv’s header argument is set to True when the table’s first row contains one fewer field name than the table’s number of columns. (This is often true when the first column contains row numbers but is not the case here.)
Weight |
LDL |
165 |
37 |
118 |
48 |
114 |
61 |
117 |
55 |
108 |
33 |
And here are the results of running the code, shown in precis form:
> precis(ldl.model) |
||||
mean |
sd |
5.50% |
94.50% |
|
alpha |
58.27 |
1.63 |
55.6 |
60.8 |
beta |
0.06 |
0.07 |
-0.03 |
0.2 |
st.dev.wt |
11.53 |
1.16 |
9.68 |
13.38 |
You can compare R’s results to Excel’s by running LINEST. The LINEST function (entered normally in current Excel versions, by selecting a single cell and using Enter rather than Ctrl+Shift+Enter) is
=LINEST(B2:B51,A2:A51-AVERAGE(A2:A51),,TRUE)
Here are the results of running the Excel LINEST function:
0.064249 |
58.52 |
0.070996 |
1.655082 |
0.016775 |
11.7032 |
0.818952 |
48.0 |
112.1677 |
6574.312 |
The LINEST results require some mapping:
The value in LINEST’s first row and rightmost column (here, 58.52) is always the equation’s intercept. Notice that the quap model returns a value of 58.27 (second row, second column of the precis summary). The two values are quite close, and the difference is easily attributable to sampling error in the quap model.
The value in LINEST’s first row and leftmost column (here, 0.064) is always the final regression coefficient. In this case, because we have called for one coefficient only, it is also the equation’s first and only coefficient.
The value in LINEST’s second row and rightmost column (here, 1.655) is the standard error of the intercept. It is always in that cell, of LINEST’s results, directly below the intercept. Its value is quite close to that returned by precis in its third column, second row, 1.63.
The value in LINEST’s second row and leftmost column (here, 0.071) is the standard error of the regression coefficient. Values in the second row of LINEST results are always the standard error of the statistic in the same column, first row.
The value in LINEST’s third row and rightmost column (here, 11.7) is the standard error of estimate, and it is quite close to the precis estimate of 11.53. Suppose that you took all the observations at a given value of the predictor and found the standard deviation of the difference between their actual and the predicted values on the predicted variable. That’s the standard error of estimate, and it helps you decide whether the prediction equation is more accurate at some levels of the predictor than at others.
Compare the results of the regression analysis as returned by Excel with those returned by quap via precis. It’s clear that where the two approaches return the same analyses (e.g., intercepts, coefficients, standard errors), the Bayesian approach and the frequentist approach are either identical or very nearly so.
And you can get those results without risking the slippery slope of multicollinearity. Which makes this a good point to go further into multiple regression.