Introduction to Statistical Analysis: Microsoft Excel 2013
There was no reason I shouldn’t have already written a book about statistical analysis using Excel. But I didn’t, although I knew I wanted to. Finally, I talked Pearson into letting me write it for them.
Be careful what you ask for. It’s been a struggle, but at last I’ve got it out of my system, and I want to start by talking here about the reasons for some of the choices I made in writing this book.
Using Excel for Statistical Analysis
The problem is that it’s a huge amount of material to cover in a book that’s supposed to be only 400 to 500 pages. The text used in the first statistics course I took was about 600 pages, and it was purely statistics, no Excel. In 2001, I co-authored a book about Excel (no statistics) that ran to 750 pages. To shoehorn statistics and Excel into 400 pages or so takes some picking and choosing.
Furthermore, I did not want this book to be an expanded Help document, like one or two others I’ve seen. Instead, I take an approach that seemed to work well in an earlier book of mine, Business Analysis with Excel. The idea in both that book and this one is to identify a topic in statistical (or business) analysis; discuss the topic’s rationale, its procedures, and associated issues; and only then get into how it’s carried out in Excel.
You shouldn’t expect to find discussions of, say, the Weibull function or the lognormal distribution here. They have their uses, and Excel provides them as statistical functions, but my picking and choosing forced me to ignore them—at my peril, probably—and to use the space saved for material on more bread-and-butter topics such as statistical regression.
About You and About Excel
How much background in statistics do you need to get value from this book? My intention is that you need none. The book starts out with a discussion of different ways to measure things—by categories, such as models of cars, by ranks, such as first place through tenth, by numbers, such as degrees Fahrenheit—and how Excel handles those methods of measurement in its worksheets and its charts.
This book moves on to basic statistics, such as averages and ranges, and only then to intermediate statistical methods such as t-tests, multiple regression, and the analysis of covariance. The material assumes knowledge of nothing more complex than how to calculate an average. You do not need to have taken courses in statistics to use this book.
As to Excel itself, it matters little whether you’re using Excel 97, Excel 2013, or any version in between. Very little statistical functionality changed between Excel 97 and Excel 2003. The few changes that did occur had to do primarily with how functions behaved when the user stress-tested them using extreme values or in very unlikely situations.
The Ribbon showed up in Excel 2007 and is still with us in Excel 2013. But nearly all statistical analysis in Excel takes place in worksheet functions—very little is menu driven—and there was almost no change to the function list, function names, or their arguments between Excel 97 and Excel 2007. The Ribbon does introduce a few differences, such as how to get a trendline into a chart. This book discusses the differences in the steps you take using the traditional menu structure and the steps you take using the Ribbon.
In Excel 2010, several apparently new statistical functions appeared, but the differences were more apparent than real. For example, through Excel 2007, the two functions that calculate standard deviations are STDEV() and STDEVP(). If you are working with a sample of values, you should use STDEV(), but if you happen to be working with a full population, you should use STDEVP(). Of course, the P stands for population.
Both STDEV() and STDEVP() remain in Excel 2010 and 2013, but they are termed compatibility functions. It appears that they may be phased out in some future release. Excel 2010 added what it calls consistency functions, two of which are STDEV.S() and STDEV.P(). Note that a period has been added in each function’s name. The period is followed by a letter that, for consistency, indicates whether the function should be used with a sample of values or a population of values.
Other consistency functions were added to Excel 2010, and the functions they are intended to replace are still supported in Excel 2013. There are a few substantive differences between the compatibility version and the consistency version of some functions, and this book discusses those differences and how best to use each version.
Clearing Up the Terms
Terminology poses another problem, both in Excel and in the field of statistics (and, it turns out, in the areas where the two overlap). For example, it’s normal to use the word alpha in a statistical context to mean the probability that you will decide that there’s a true difference between the means of two groups when there really isn’t. But Excel extends alpha to usages that are related but much less standard, such as the probability of getting some number of heads from flipping a fair coin. It’s not wrong to do so. It’s just unusual, and therefore it’s an unnecessary hurdle to understanding the concepts.
The vocabulary of statistics itself is full of names that mean very different things in slightly different contexts. The word beta, for example, can mean the probability of deciding that a true difference does not exist, when it does. It can also mean a coefficient in a regression equation (for which Excel’s documentation unfortunately uses the letter m), and it’s also the name of a distribution that is a close relative of the binomial distribution. None of that is due to Excel. It’s due to having more concepts than there are letters in the Greek alphabet.
You can see the potential for confusion. It gets worse when you hook Excel’s terminology up with that of statistics. For example, in Excel the word cellmeans a rectangle on a worksheet, the intersection of a row and a column. In statistics, particularly the analysis of variance, cell usually means a group in a factorial design: If an experiment tests the joint effects of sex and a new medication, one cell might consist of men who receive a placebo, and another might consist of women who receive the medication being assessed. Unfortunately, you can’t depend on seeing “cell” where you might expect it: within cell error is called residual error in the context of regression analysis.
So this book presents you with some terms you might otherwise find redundant: I use design cell for analysis contexts and worksheet cell when referring to the software context where there’s any possibility of confusion about which I mean.
For consistency, though, I try always to use alpha rather than Type I error or statistical significance. In general, I use just one term for a given concept throughout. I intend to complain about it when the possibility of confusion exists: when mean square doesn’t mean mean square, you ought to know about it.
Making Things Easier
If you’re just starting to study statistical analysis, your timing’s much better than mine was. You have avoided some of the obstacles to understanding statistics that once—as recently as the 1980s—stood in the way. I’ll mention those obstacles once or twice more in this book, partly to vent my spleen but also to stress how much better Excel has made things.
Suppose that 25 years ago you were calculating something as basic as the standard deviation of twenty numbers. You had no access to a computer. Or, if there was one around, it was a mainframe or a mini, and whoever owned it had more important uses for it than to support a Psychology 101 assignment.
So you trudged down to the Psych building’s basement, where there was a room filled with gray metal desks with adding machines on them. Some of the adding machines might even have been plugged into a source of electricity. You entered your twenty numbers very carefully because the adding machines did not come with Undo buttons or Ctrl+Z. The electricity-enabled machines were in demand because they had a memory function that allowed you to enter a number, square it, and add the result to what was already in the memory.
It could take half an hour to calculate the standard deviation of twenty numbers. It was all incredibly tedious and it distracted you from the main point, which was the concept of a standard deviation and the reason you wanted to quantify it.
Of course, 25 years ago our teachers were telling us how lucky we were to have adding machines instead of having to use paper, pencil, and a box of erasers.
Things are different in 2013, and truth be told, they have been changing since the mid 1980s when applications such as Lotus 1-2-3 and Microsoft Excel started to find their way onto personal computers’ floppy disks. Now, all you have to do is enter the numbers into a worksheet—or maybe not even that, if you downloaded them from a server somewhere. Then, type =STDEV.S( and drag across the cells with the numbers before you press Enter. It takes half a minute at most, not half an hour at least.
Several statistics have relatively simple definitional formulas. The definitional formula tends to be straightforward and therefore gives you actual insight into what the statistic means. But those same definitional formulas often turn out to be difficult to manage in practice if you’re using paper and pencil, or even an adding machine or hand calculator. Rounding errors occur and compound one another.
So statisticians developed computational formulas. These are mathematically equivalent to the definitional formulas, but are much better suited to manual calculations. Although it’s nice to have computational formulas that ease the arithmetic, those formulas make you take your eye off the ball. You’re so involved with accumulating the sum of the squared values that you forget that your purpose is to understand how values vary around their average.
That’s one primary reason that an application such as Excel, or an application specifically and solely designed for statistical analysis, is so helpful. It takes the drudgery of the arithmetic off your hands and frees you to think about what the numbers actually mean.
Statistics is conceptual. It’s not just arithmetic. And it shouldn’t be taught as though it is.
The Wrong Box?
But should you even be using Excel to do statistical calculations? After all, people have been moaning about inadequacies in Excel’s statistical functions for twenty years. The Excel forum on CompuServe had plenty of complaints about this issue, as did the Usenet newsgroups. As I write this introduction, I can switch from Word to Firefox and see that some people are still complaining on Wikipedia talk pages, and others contribute angry screeds to publications such as Computational Statistics & Data Analysis, which I believe are there as a reminder to us all of the importance of taking our prescription medication.
I have sometimes found myself as upset about problems with Excel’s statistical functions as anyone. And it’s true that Excel has had, and in some cases continues to have, problems with the algorithms it uses to manage certain functions such as the inverse of the F distribution.But most of the complaints that are voiced fall into one of two categories: those that are based on misunderstandings about either Excel or statistical analysis, and those that are based on complaints that Excel isn’t accurate enough.
If you read this book, you’ll be able to avoid those kinds of misunderstandings. As to inaccuracies in Excel results, let’s look a little more closely at that. The complaints are typically along these lines:
I enter into an Excel worksheet two different formulas that should return the same result. Simple algebraic rearrangement of the equations proves that. But then I find that Excel calculates two different results.
Well, for the data the user supplied, the results differ at the fifteenth decimal place, so Excel’s results disagree with one another by approximately five in 111 trillion.
Or this:
I tried to get the inverse of the F distribution using the formula FINV(0.025,4198986,1025419), but I got an unexpected result. Is there a bug in FINV?
No. Once upon a time, FINV returned the #NUM! error value for those arguments, but no longer. However, that's not the point. With so many degrees of freedom (over four million and one million, respectively), the person who asked the question was effectively dealing with populations, not samples. To use that sort of inferential technique with so many degrees of freedom is a striking instance of "unclear on the concept."
Would it be better if Excel's math were more accurate—or at least more internally consistent? Sure. But even the finger-waggers admit that Excel's statistical functions are acceptable at least, as the following comment shows.
They can rarely be relied on for more than four figures, and then only for 0.001 < p < 0.999, plenty good for routine hypothesis testing.
Now look. Chapter 6, "Telling the Truth with Statistics," goes into this issue further, but the point deserves a better soapbox, closer to the start of the book. Regardless of the accuracy of a statement such as "They can rarely be relied on for more than four figures," it's pointless to make it. It's irrelevant whether a finding is "statistically significant" at the 0.001 level instead of the 0.005 level, and to worry about whether Excel can successfully distinguish between the two findings is to miss the context.
There are many possible explanations for a research outcome other than the one you’re seeking: a real and replicable treatment effect. Random chance is only one of these. It’s one that gets a lot of attention because we attach the word significance to our tests to rule out chance, but it’s not more important than other possible explanations you should be concerned about when you design your study. It’s the design of your study, and how well you implement it, that allows you to rule out alternative explanations such as selection bias and disproportionate dropout rates. Those explanations—bias and dropout rates—are just two examples of possible explanations for an apparent treatment effect: explanations that might make a treatment look like it had an effect when it actually didn’t.
Even the strongest design doesn’t enable you to rule out a chance outcome. But if the design of your study is sound, and you obtained what looks like a meaningful result, you’ll want to control chance’s role as an alternative explanation of the result. So, you certainly want to run your data through the appropriate statistical test, which does help you control the effect of chance.
If you get a result that doesn’t clearly rule out chance—or rule it in—you’re much better off to run the experiment again than to take a position based on a borderline outcome. At the very least, it’s a better use of your time and resources than to worry in print about whether Excel’s F tests are accurate to the fifth decimal place.
Wagging the Dog
And ask yourself this: Once you reach the point of planning the statistical test, are you going to reject your findings if they might come about by chance five times in 1,000? Is that too loose a criterion? What about just one time in 1,000? How many angels are on that pinhead anyway?
If you’re concerned that Excel won’t return the correct distinction between one and five chances in 1,000 that the result of your study is due to chance, you allow what’s really an irrelevancy to dictate how, and using what calibrations, you’re going to conduct your statistical analysis. It’s pointless to worry about whether a test is accurate to one point in a thousand or two in a thousand. Your decision rules for risking a chance finding should be based on more substantive grounds.
Chapter 9, “Testing Differences Between Means: Further Issues,” goes into the matter in greater detail, but a quick summary of the issue is that you should let the risk of making the wrong decision be guided by the costs of a bad decision and the benefits of a good one—not by which criterion appears to be the more selective.
What's in This Book
You’ll find that there are two broad types of statistics. I’m not talking about that scurrilous line about lies, damned lies and statistics—both its source and its applicability are disputed. I’m talking about descriptive statistics and inferential statistics.
No matter if you’ve never studied statistics before this, you’re already familiar with concepts such as averages and ranges. These are descriptive statistics. They describe identified groups: The average age of the members is 42 years; the range of the weights is 105 pounds; the median price of the houses is $270,000. A variety of other sorts of descriptive statistics exists, such as standard deviations, correlations, and skewness. The first five chapters of this book take a fairly close look at descriptive statistics, and you might find that they have some aspects that you haven’t considered before.
Descriptive statistics provides you with insight into the characteristics of a restricted set of beings or objects. They can be interesting and useful, and they have some properties that aren’t at all well known. But you don’t get a better understanding of the world from descriptive statistics. For that, it helps to have a handle on inferential statistics. That sort of analysis is based on descriptive statistics, but you are asking and perhaps answering broader questions. Questions such as this:
The average systolic blood pressure in this group of patients is 135. How large a margin of error must I report so that if I took another 99 samples, 95 of the 100 would capture the true population mean within margins calculated similarly?
Inferential statistics enables you to make inferences about a population based on samples from that population. As such, inferential statistics broadens the horizons considerably.
Therefore, I have prepared two new chapters on inferential statistics for this 2013 edition of Statistical Analysis: Microsoft Excel. Chapter 12, "Experimental Design and ANOVA," explores the effects of fixed versus random factors on the nature of your F tests. It also examines crossed and nested factors in factorial designs, and how a factor's status in a factorial design affects the mean square you should use in the F ratio's denominator.
I have also expanded coverage of the topic of statistical power, and this edition devotes an entire chapter to it. Chapter 13, "Statistical Power," discusses how to use Excel's worksheet functions to generate F distributions with different noncentrality parameters. (Excel's native F() functions all assume a noncentrality parameter of zero.) You can use this capability to calculate the power of an F test without resorting to 80-year-old charts.
But you have to take on some assumptions about your samples, and about the populations that your samples represent, to make the sort of generalization that inferential statistics makes available to you. From Chapter 6 through the end of this book, you'll find discussions of the issues involved, along with examples of how those issues work out in practice. And, by the way, how you work them out using Microsoft Excel.