# Applied Statistics for Experimental Biology

*Draft: 2021-09-25*

# Preface

*More cynically, one could also well ask “Why has medicine not adopted frequentist inference, even though everyone presents P-values and hypothesis tests?” My answer is: Because frequentist inference, like Bayesian inference, is not taught. Instead everyone gets taught a misleading pseudo-frequentism: a set of rituals and misinterpretations caricaturing frequentist inference, leading to all kinds of misunderstandings.* – Sander Greenland

We use statistics to learn from data with uncertainty. Traditional introductory textbooks in biostatistics implicitly or explicitly train students and researchers to “discover by p-value” using hypothesis tests (Chapter 6). Over the course of many chapters, the student learns to use a look-up table or flowchart to choose the correct “test” for the data at hand, compute a test statistic for their data, compute a *p*-value based on the test statistic, and compare the *p*-value to 0.05. Textbooks typically give very little guidance about what can be concluded if \(p < 0.05\) or if \(p > 0.05\), but many researchers conclude, incorrectly, they have “discovered” an effect if \(p < 0.05\) but found “no effect” if \(p > 0.05\).

This book is an introduction to the statistical analysis of data from biological experiments with a focus on the estimation of treatment effects and measures of the uncertainty of theses estimates. Instead of a flowchart of “which statistical test,” this book emphasizes a **regression modeling** approach using **linear models** and extensions of linear models.

“What what? I learned from the post-doc in my lab that regression was for data with a continuous independent variable and that *t*-tests and ANOVA were for data with categorical independent variables.” No! This misconception has roots in the history of regression vs. ANOVA and is reinforced by how introductory biostatistics textbooks, and their instructors, *choose* to teach statistics.

Linear regression, *t*-tests and ANOVA are special cases of a **linear model**. The different linear models in this text are all variations of the equation for a line \(Y = mX + b\) using slightly different notation

\[\begin{equation} \mathrm{E}(Y | X) = \beta_0 + \beta_1 X \tag{0.1} \end{equation}\]

The chapter An introduction to linear models explains the meaning of this notation. Here, just recognize that this same equation can be used for both classical regression *and* *t*-tests/ANOVA. In regression, \(X\) is a continuous variable. In a *t*-test or ANOVA, \(X\) is a numeric **indicator variable** indicating group membership (“wildtype” or “knockout”).

The linear model underneath a typical *t*-tests/ANOVA is some variation of

\[\begin{equation} \overline{Y}_k = \mu + \alpha_k \tag{0.2} \end{equation}\]

where \(\mu\) is the grand mean and \(\alpha_k\) is the difference between the mean of treatment *k* and the grand mean.

In this text, I use **linear model** for any model that looks like Equation (0.1) and **ANOVA model** for any model that looks like Equation (0.2) (many statistics textbooks call these “cell-mean models”). It would be more correct to call models that look like Equation (0.1) a **regression model**, since ANOVA models are also linear models, but the word “regression” has a lot of baggage because of how it is taught – as a method for data with continuous \(X\). If I said, “this is a book of regression models for analyzing your experimental data,” you might quite reasonably, but incorrectly, assume that the book wasn’t really relevant to your experiments because your independent variables are categorical (“WT” vs.“KO”) and not continuous.

Table 0.1 is a compact summary of the linear models introduced in this text, which covers a large fraction of the kinds of models researchers would need with experimental data.

## Why bother with linear models – aren’t *t*-tests and ANOVA good enough?

For many experiments, the linear models advocated here will give the same *p*-value as a *t*-test or ANOVA, which raises the question, why bother with linear models? Some answers include

**Biologically meaningful focus**. Linear models encourage looking at, thinking about, and reporting estimates of the size of a treatment effect and the uncertainty of the estimate. The estimated treatment effect is the difference in the response between two treatments. If the mean plasma glucose concentration over the period of a glucose tolerance test is 15.9 mmol/l in the knockout group and 18.9 mmol/l in the wildtype group, the estimated effect is -3.0 mmol/l. The magnitude of this effect is our measure of the difference in glucose tolerance between the two treatments. What is the physiological consequence of this difference? Is this a big difference that would excite NIH or a trivial difference that encourages us to pursue some other line of research? I don’t know the answers to these questions– I’m not a metabolic physiologist. Researchers in metabolic physiology should know the answers, but if they do, they don’t indicate this in the literature. Effect sizes are rarely reported in the experimental bench-biology literature.

What is reported are *p*-values. Extremely small *p*-values give some researchers the confidence that an effect is large or important. This confidence is unwarranted. *P*-values are not a measure of effect size. If the conduction of the experiment and analysis of the results closely approximate the model underlying the computation of the *p*-value, then a *p*-value dampens the frequency that we are fooled by randomness and gives a researcher some confidence in the direction (positive or negative) of an effect.

*P*-values are neither necessary nor sufficient for good data analysis. But, a *p*-value is a useful tool in the data analysis toolkit. Importantly, the estimation of effects and uncertainty and the computation of a *p*-value are not alternatives. Throughout this text, linear models are used to compute a *p*-value *in addition to* the estimates of effects and their uncertainty.

**NHST Blues** – The emphasis on *p*-values as *the* measure to report is a consequence of the “which statistical test?” strategy of data analysis. This practice, known as Null-Hypothesis Significance Testing (NHST), has been criticized by statisticians for many, many decades. Nevertheless, introductory biostatistics textbooks written by both biologists and statisticians continue to organize textbooks around a collection of hypothesis tests, with a great deal of emphasis on “which statistical test?” and much less emphasis on estimation and uncertainty. The NHST/which-statistical-test strategy of learning or doing statistics is easy in that it requires little understanding of the statistical model underneath the tests and its assumptions, limitations, and behavior. The NHST strategy in combination with point-and-click software enables mindless statistics and encourages the belief that statistics is a tool like a word processor is a tool, afterall, a rigorous analysis of one’s data requires little more than getting p-values and creating bar plots. Indeed, many PhD programs in the biosciences require no statistics coursework and the only training available to students is from the other graduate students and postdocs in the lab. As a consequence, the biological sciences literature is filled with error bars that imply data with negative values and p-values that have little relationship to the probability of the data under the null. More importantly for science, the reported statistics are often not doing for the study what the researchers and journal editors think they are doing.

**Flexibility**. Linear models and extensions of linear models are all variations of \(Y = \beta_0 + \beta_1 X\). Generalizations of this basic model (see Table 0.1) include linear models with added covariates or multiple factors, generalized least squares models for heterogeneity of variance or correlated error, linear mixed models for correlated error or hierarchical grouping, generalized linear models for samples from non-normal distributions, generalized additive models for responses that vary non-linearly over time, causal graphical models for inferring causal effects with observational data, multivariate models for multiple responses, and some machine learning models. This book is not a comprehensive source for any of these methods but an introduction to the common foundations of all. 3 related - no equivalent- modern (glm v non-parametric) related to #1 and #2
- highlight experimental issues (correlated error)
**Gateway drug**. Many of the statistical models used in genomics and neurosciences are variations of the linear models introduced here. There will be a steep learning curve for these methods if your statistical training consists of*t*-tests, ANOVA, and Mann-Whitney non-parametric tests.

## What is unusual about this book?

**Real data sets from the recent experimental biology literature**. All of the data in this text are from articles in which the researchers have followed open science best practices and made the data freely available by archiving the data with the article. This not only let’s me explain modern statistics relevant to experimental biologist’s experiments using data collected by experimental biologists but also allows me to connect the statistical explanation with the goals of the experiment. A potential downside of using real data is that understanding any analysis requires some level of understanding of the underlying biology and the underlying methods of data collection. I try to give enough background biology to understand the motivation of the experiment and experimental methodology to understand the data.- Because I scan many, many articles in experimental biology looking for good archived data to use as examples in this text, I’ve developed an appreciation for the variation in how experimental biologists think about and do statistics. A consequence of this is the recognition of the many ways that researchers use non-best practices and occasionally worst practices. Much of the material in each chapter and one entire chapter (Issues in inference) is devoted to addressing the consequences of these non-best and worst practices.