A series of basic statistics by Tom Lang

3. Hypothesis Testing

Introduction

Statistical analysis is an essential part of research, so much so that clinical and epidemiological investigations should almost never be undertaken without the close cooperation of a qualified statistician. In addition, all researchers and readers of the medical literature need a working knowledge of research designs and statistical analyses if they are to practice evidence-based medicine.
The two most common schools of statistical thinking in medicine are the Frequentists (those who use classical hypothesis tests that generate P values) and the Bayesians (who model a process in which additional information is combined with existing information to create new information.) The Frequentist approach is more common and has a longer history in medicine, but it is not easy to understand and has always had major problems, many of which are not widely known. Although the Bayesian approach is mathematically complex and is therefore not easily applied without computers, it is easier to understand and avoids many of the problems of the Frequentist approach.
Here, I introduce some concepts from the Frequentist perspective and will leave the Bayesian perspective for another time. Bayesian statistics is becoming increasingly popular, however, and is slowly displacing Frequentist statistics in much medical research.

Hypothesis Testing

In hypothesis testing, a specific claim is proposed, such as "the drug will increase serum concentrations of high density lipoprotein (HDL) by a mean of 30 mg/dL." In common terms, the question becomes "Does a drug increase HDL concentrations?" In statistical terms, however, the question is worded differently: "At the end of the study, what is the probability that the patients who took the drug were drawn from the same population as those who did not?" To understand this question and its answer, we need to review the experimental procedure.
First, we draw a sample from, say, a population of adults with high-density lipoprotein (HDL) concentrations less than 40 mg/dL. We then randomly assign each patient to a treatment or a control group. With a large enough sample, the groups will be "equivalent at baseline," meaning that both known and unknown characteristics will be more or less balanced between groups. Random assignment also means that any statistical or clinical differences between groups are the result of chance and not of selection bias in forming experimental groups.
We administer the drug to the treatment group and a placebo to the control group. At the end of the study, we collect data on the response variable (serum HDL concentration) from both groups and compare the distributions of these concentrations to see whether they are now different; that is, whether they now appear to be two
Figure 1 Deciding whether two groups are different at the end of the experiment. In panel A, the drug greatly increased the variability of the values in the treatment group. The change in variability (statistically, the variance) is large enough to conclude that the groups are now different. In panel B, it's easy to see that there are two groups: the variances are the same, but the means are far apart, and there is no overlap in the scores. In panel C, the groups begin to look alike, and in panel D, they appear to be so similar that they are probably just two subsets of our original group. Whether the groups are different is a medical question: is the difference large enough to be clinically important?
Now, we have to ask, are the groups different? Did the drug have an effect? The answer is a medical issue: "is the difference between means clinically important?," a question often overlooked by authors.
If we determined the "minimal clinically important difference" before the study, and the mean difference we found in the experiment is greater than the minimal important difference, we then have to ask, is the difference the result of the drug or of chance? This question is statistical and involves the concept of type I or "alpha error." If we attribute the difference to the drug but chance turns out to be the more plausible explanation, we have committed a type I error. Alpha error is usually set at 0.05, or a willingness to accept a type I error 5 times in 100 similar comparisons; 5 times in 100 we will wrongly attribute the difference to the drug because chance is more probable cause.


A Pretend Example that Illustrates Type I Error(Figure 2)
Figure 2 An illustration of hypothesis testing. The means of all samples taken from the control group are subtracted from the means of all samples taken from the treatment group. The differences will form a normal distribution with a mean of zero and a measure of dispersion called the standard error of the difference (SEdiff). The region defined by plus and minus 2 SEdiff is the "region of rejection" of the null hypothesis. A difference falling beyond this range would occur by chance less than 5 times in 100, so we call it "statistically significant."
1. We assume that the drug was ineffective; that the treatment and control groups are still equivalent at the end of the study. This assumption is the "null hypothesis of no difference."
2. We then pretend that the treatment and control groups are infinitely large.
3. We take a sample of, say 35 adults from the treatment group and 35 adults from the control group. We subtract the mean value of the control group from the mean value of the treatment group and graph the difference.
4. We repeat this process until we have subtracted the means of all possible samples (combinations) of 35 adults from the control group from the means of all possible samples (combinations) of 35 adults from the treatment group. (Remember, this is a pretend example where the concepts are accurate but the explanation is imaginary.) When we graph the differences between all possible samples of each group, the mean of this distribution will be normally distributed, which is convenient because we know a lot about normal distributions. In particular, we know the distribution is symmetrical about the mean and that the "area under the curve" can be expressed in units of standard deviations. (See the first article in this series for a more complete explanation.)
5. If the null hypothesis of no difference between means is true, the mean of this new distribution will be zero. That is, if the groups are the same, subtracting one mean from the other will give a difference of zero in most cases, and the difference between all means will cluster around zero.
6. The standard deviation of this distribution of the differences between sample means is called the standard error of the difference (SEdiff). Its relationship to the normal distribution is the same as that of the standard deviation to a distribution of data and to the standard error of the mean to a distribution of all possible sample means. We call it the SEdiff because it is associated with a distribution of all possible differences between all possible samples of the two groups.
7. Because these differences are normally distributed, 95% of all differences will fall within plus or minus 2 SEdiff of the mean of zero: this range is the region of acceptance of the null hypothesis. A mean difference in 100 pairs of samples that falls in this range would occur by chance in 95 of these pairs under the null hypothesis, well above the 5% rate we set in advance.
8. A difference between two actual samples that falls beyond plus or minus 2 SEdiff will occur by chance, but only less than 5 times in 100 (that is, P < 0.05 if alpha = 0.05) under the null hypothesis: this range is the region of rejection of the null hypothesis. In other words, even if the drug is completely ineffective, a difference in the region of rejection of the null hypothesis as large or larger than the one we found could still have occurred by chance. When this chance is small—less than 5 times in 100—we usually reject the null hypothesis and conclude that the results were statistically significant and that the drug was responsible for the differences between groups.
Figure 3 The area under the normal distribution. The "area under the curve" represents 100% of the data; in this case, the differences between the means of all possible samples of the treatment and control group. The area under portions of the curve can be identified in units of standard deviation, or in this case, the standard error of the difference. The mean of zero plus and minus 2 standard errors of the difference includes 95% of all possible differences between sample means. A difference in this range would occur by chance in 95 of 100 similar studies, so the result would be not be statistically significant.
The Actual Approach to Determining P Values In reality, we don't get to take all possible samples of an infinite population. We take only one sample, divide it into a treatment and a control group, and put the data from each distribution into an equation called the t-test. The t-test produces a number called the test statistic, which is then located on a probability distribution from which the probably or P value is determined.

The P value indicates the probability of finding a difference, by chance, that is as large or larger than the one observed, if the null hypothesis is true. Thus, the P value is a measure of evidence against the null hypothesis; the smaller the P value, the less evidence in support of the null hypothesis. If the P value is smaller than the alpha level (say, 0.05), the evidence supporting the null hypothesis is so weak that the null hypothesis is rejected, and we call the results statistically significant. A Pretend Example that Illustrates Type II Error If the mean difference in the above experiment is less than the "minimal clinically important difference," we have to ask, "is the lack of difference the result of an ineffective drug or of insufficient data? After all, we tested only a few of all adults with low HDL concentrations, and we might have inadvertently sampled adults in which the drug was ineffective. This question is also statistical and involves the concept of type II or "beta error," which in turn is related to statistical power. If we attribute the similarity between the groups to a poor drug, and insufficient data turns out to be the more plausible explanation, we have committed a type II error. Beta error is usually set at 0.2 (although 0.1 is common and other values are possible), or a willingness to accept a type II error 20 times in 100 similar comparisons; 20 times in 100 similar experiments, we will incorrectly attribute the lack of a difference to the drug when a small sample size is the more probable cause.
Just as we identified the minimally important difference before the experiment, we should also determine the sample size we need to study to find that difference. This process involves a power calculation. Statistical power is defined a 1-beta. If we set our beta error at 0.2, statistical power would be 80%. The statistical power calculation determines how many patients we need to study to have an 80% chance of detecting the minimally important difference if that difference exits in the population.
The statistical power calculation has several variables. In Table, you can see the effect of these variables on sample size. The left-most column is the minimally clinically important difference. Here, we have to find a difference between means of 5% to judge the treatment worthwhile. The second column is the standard deviation, or a measure of how much variability we expect to find in the data. In this case, the value is 20 (in this example, we won't use units). The third column is the alpha level, the fourth is the desired statistical power, and the fifth column is the needed sample size. Thus, in the top row, the right-most column is the sample size we need to study to have an 80% chance of detecting a 5% difference if that difference exits in the population.
Table The factors in a power calculation. Using a one-sided test or a detecting a larger difference reduces the needed sample size, whereas more variability in the data, a stricter alpha level, and greater statistical power increase the sample size. (See text for details.)
(The second row refers to a one-sided test. A two-sided or two-tailed test gives us the probability that the "direction of the difference" will be either higher or lower than the mean of zero. In a two-tailed test, the alpha level of 5% is divided into two, 2.5%-areas of rejection of the null hypothesis on either side of the distribution of differences. In a one-tailed test, we are sure that the difference will be in only one direction, so we keep the 5% area of rejection on only one side of the distribution.)

The Importance of Adequate Statistical Power

The critical point to remember about statistical power is that in "under-powered" studies, a result that is not statistically significant does not mean that the groups are equivalent; it just means that too few data were collected to have a real chance of detecting a difference.
"Absence of evidence is not evidence of absence." If we want to see whether the two groups are similar, we have use an "equivalence" study, which sets a "confidence band" around the mean of one group and power the study to find a value in the other group that is included in that band. Small differences usually require larger samples, which is why equivalence studies are often larger than studies looking for differences.

Bibliography

1)Rowntree D. Statistics Without Tears: An Introduction for Non-Mathematicians. London: Penguin Books, 2000
2)Lang TA, Secic M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers. Philadelphia: American College of Physicians, 1997. Reprinted in English for distribution within China, 1998. Chinese translation, 2001. Second edition, 2006. Japanese translation, 2011; Russian translation, 2013.