Why are we all teaching ancient statistics?
If we taught molecular biology the way we teach statistics, we'd still be arguing whether proteins or nucleic acids were genetic material.
The statistical tests that are at the foundation of our teaching, as well as most research programs, are century-old technology. Ronald Fisher came up with the ANOVA in 1918, building on the structure of the t-test, which by the way was developed a decade earlier for quality control by an employee of the Guinness Brewery and was initially published anonymously as “Student.”
Do t-tests and ANOVAs work? Well, yes, and no. They work the way they were intended to work by their inventors. However, we have a lot more sophisticated ways of evaluating for potential differences among means of sample populations than these parametric tests. These more sophisticated approaches are not more theoretically confusing or more mathematically problematic. Approaches using bootstrapping and bayesian logic are simply more robust and sensible than parametric testing. I thought the only reason we have parametric tests is because better approaches did not align with the limited capacity for computation at the time1. This was necessary because more sophisticated approaches to sampling from limited data were not available because computers weren’t a thing.
We had to rely on parametric tests because this was state of the art in the absence of computers. But now, we all do statistics with computers. When I used to teach biostatistics, we ran some basic statistical tests by hand and with probability tables, but only as a learning tool to understand what the computer does, and to understand how and why statistical tests were invented in the first place. But since we all have access to computers, shouldn’t we all be adopting conceptually more robust to asking and answering questions, and making avail of the computational power readily available to us to take us out of the early 19th century?
The classic frequentist vs. bayesian split that underlies much of this dilemma, but even if you’re a frequentist at heart because of your training or deep understanding of the philosophy of hypothesis testing, I think you have to admit that contemporary research could manage to do a lot better than the t-tests and ANOVAs that many of us are taught, and end up relying on heavily ourselves when doing our own research.
While there are cadres of biostatisticans who have advanced us well into this century, in most subdisciplines, I think it’s fair to say that much of what they’ve developed hasn’t permeated into common research practices. There are all of these great ideas, tools, and practices that somehow just haven’t helped us make the leap past 1950. What’s up with that? I suspect it’s as simple as the reality is that we do what we were taught, and we teach what we know, and all of the people who introduce us to statistics as undergraduates and in grad school haven’t been inducting us into more contemporary lines of thought because, well, they’re probably not doing it themselves. And even if some folks are, they realize they need to teach it the way everybody else is because it would be unthinkable and possibly educational malpractice to not teach basic parametric tests as standard practice. We’ve got to create trainees who can navigate the world as it exists.
I’m as much a guilty party as everybody else. While I believe I’ve been able to stay marginally up to date as much as any ecologist who has the benefit of collaborating with data science whizzes who are wholly up to speed on new approaches, I’m not going to be whipping these out to run myself anytime soon. I think we are sometimes doing better at data visualization, but while there’s been a strong movement to emphasize effect sizes instead of p-values as a response to a treatment or measure of a difference, have we really made that much progress. How many folks are familiar with Cohen’s d, or odds ratios?
We often teach stats as simple plug-and-play, or if we do a little better, than we design the experiment in mind to maximize the power for the test. But how often do we put the philosophy of science and experimental design up front so that the quantitative questions we are asking are the ones most suited for what we are looking to learn about the system?
If we have an educational crisis involving the philosophy of science among science majors, then I think the root of such a crisis is how we teach and implement statistics and how we communicate about probability. I realize we can’t all simultaneously flip a switch and bring everybody up to speed on better approaches, and I think a lot of that is because we’d have a lot of arguing about what constitutes a better approach. So instead, we’re settling for what our professors were taught by their professors were taught by their professors were taught by their professors. This wouldn’t be okay in any other part of biology, but in biological statistics, this is somehow okay?
Isn’t it weird that we are training our students to manage huge amounts of data and process them using a tool as powerful as R, but then just creating p-values based on parametric probability distributions just like Fisher did 120 years ago? To experience advances in how we teach and do statistics, we need to make advances in how to think about and design experiments. We need to advance in how we teach and apply the philosophy of science. To teach statistics better, we need to build stronger bridges with the humanities.
I just spent far too long trying to track down a quote I once saw from RA Fisher or Sewall Wright that even said if we had a “computer” machine, we wouldn’t have to have to rely on limitations of parametric assumptions because we could use more robust resampling approaches that would be more informative and less subject to biases leading from assumptions about distributions. If any of you know of this, could you give me a pointer or put it in the comments? It’s driving me bananas. Good thing I’m not writing this in my office, or I’d have the chance to waste even more time flipping through Sokal and Rohlf and how many other textbooks where I think I might have seen it.
Terry, I agree in principle that there are "better" approaches than ANOVA etc. for many kinds of data. But in practice, I think we greatly overestimate how often this will make any difference. Yes, ANOVA makes lots of assumptions, and we love to wring our hands about that. But it turns out to be extraordinarily robust to most violations of those assumptions. So given that, there's a lot to be said for approaches that are simple, well understood, and can be applied broadly. I don't think the situation is as dire as you suggest!
Something that's sometimes frustrating to me as a statistics educator is that the stats ed community has made at least some meaningful progress on this front -- simulation-based inference is pretty core to how a lot of us teach algebra-based intro-level statistics now, even if approaches of more full-scale change like Danny Kaplan's book (https://dtkaplan.github.io/Lessons-in-statistical-thinking/) haven't caught on -- and yet we seem to have had little success in communicating that approach & set of tools to folks teaching stats in cognate disciplines.
So I guess what I mean here is: I think what we really need is better ways for these conversations to happen across fields, so that it's not happening separately in stats, bio, psych, etc. at all different paces, starting from zero every time.