A series of basic statistics by Tom Lang
1. Understanding Variables, Levels of Measurement,and Descriptive Statistics
Several types of variables are common in biomedical research. For simplicity, this series of articles will focus only on two (explanatory and response variables), but I will briefly describe some other types. Once the variables have been identified, data are collected about them. These data can be collected at a given "level of measurement," which determines how much information is collected about the variable. Finally, these data have to be described using certain "descriptive statistics," which summarize them so that they can be communicated and analyzed more easily.
Types of Variables
For example, in a study of mothers, sex is not a variable because the sample consists only of women. In a study of parents, sex can be a variable because the sample contains both men and women; sex can be used to distinguish one case from another.
The titles of scientific articles often (and should) identify the relationships studied in the research. In an article titled, The Effects of Aspirin on Treating Headache, the researchers studied the relationship between aspirin, defined as the drug in some preparation or dosage, and headache, defined probably by reports of how much pain is reported by patients with headache.
A response (or dependent) variable is the outcome of interest; an endpoint. It is the variable presumed to be acted on by another variable(s) and is the reason the study is conducted. An explanatory (or independent) variable is the exposure or treatment that we believe is responsible for changes in the response variable. Its value is usually known or controlled by the researcher. In the above example, the response variable is the degree of pain reported and the explanatory variable is the intervention, aspirin.
When reading a research article, it is good practice to identify the response and explanatory variables used. Variables are often identified in: titles, key words, abstracts, introductions, and especially in tables and graphs, which show the data collected on each variable
Other types of variables are commonly used in research:. which are variables whose values are kept constant so they don't interfere with interpreting the relationship studied. For example, in a study of aspirin, if every patient received the same dose, blood concentrations would probably be higher in smaller patients than in larger ones. To "control" for weight, we could 1) enroll only patients who weight about the same, 2) analyze the data by dividing the patients into categories based on weight (very light, light, normal, heavy, and very heavy), or 3) analyze the data with a statistical procedure that would consider weight as a variable in the context of other variables, such as regression analysis. which are variables that correlate (directly or inversely) with both the response variable and the explanatory variable but that are not part of the "causal chain" under study. Such variables can make interpreting the results difficult when trying to determine whether the change in the explanatory variable really had an effect on the response variable. For example, changes in the amount of ice cream consumed in the US seem to be associated with the number of murders. If this is all the information we had, we might conclude that 1) someone who eats ice cream may be more likely to kill someone or that 2) murders eat more ice cream. The relationship between ice cream and murder, however, is "confounded" by the fact that temperature is correlated with both ice cream eating and the murder rate: the hotter the weather, the more people tend to eat ice cream, the more time people spend outside, and the more easily they get angry. (There is more to this relationship than described, but the example makes the point.) These undesirable variables influence the outcome of a study, but they are not the variables of interest. In other words, they add error or "noise" that may get in the way of understanding the relationship of interest. Suppose a test is given to one group in the morning and to another in the afternoon. It may be that the morning group is refreshed and more alert and the afternoon group is more tired and less alert. Here, time-of-day could be an explanatory available.
Levels of Measurement
The categories in ordinal data are ranked, but not on a scale of equal units. Suppose we asked patients to indicate on a scale of 1 (very uncomfortable) to 5 (very comfortable) how satisfied they were with the care they received at a hospital. We have five ordinal categories. Even though these categories are numbers, it would be inappropriate to say that someone answering with a 4 was twice as comfortable as someone answering with a 2. All we can say is that one person was more satisfied than the other.
Sometimes, however, ordinal categories are treated as if they were an equal distance apart. For example, patients are often asked to indicate their degree of pain by choosing a number between zero (no pain) and 10 (the worst pain imaginable). If the preoperative pain was 8 and the postoperative pain was 2, the data may be interpreted to mean that the drop of 6 points represented a 75% reduction in pain (6/8=0.75). Because mathematical operations can sometimes be legitimately performed with ordinal categories, this level of measurement is also referred to as semi-quantitative data. The third level of measurement, and the level with the most information, is the continuous level. Continuous data are counted or measured on a scale of equal intervals and that, when graphed, form a distribution. There are two categories in continuous data. Discrete data, or interval data, do not have fractions. Counts of patients, for example, don't have fractions. Truly continuous data do have fractions: serum adrenaline concentrations can be measured in milliliters per kilogram to several decimals places, for instance. Because continuous data are measurements on a scale of equal intervals, they can be sensibly added, subtracted, multiplied, and divided. Age in years allows us to say that a 50-year-old patient is twice as old as a 25-year-old patient.
Because discrete and continuous data are handled similarly in statistical analyses, I will use continuous data to refer to both categories.
Researchers can often choose the level of measurement they want to use. In a study of blood pressure, a researcher might want to study hypertensive patients vs. non-hypertensive patients (a nominal/binomial level of measurement), hypotensive vs. normotensive vs. hypertensive patients (an ordinal level), or blood pressure measured in millimeters of mercury (a continuous level).
Sometimes, data collected at a continuous level of measurement are separated into a series of ordinal categories. Age measured in years at a continuous level may be turned into ordinal data by using age groups of, say, deciles (groups that each span 10 years: 0 to 9, 10 to 19, 20 to 29, and so on). However, when moving from a continuous to an ordinal level of measurement, information is lost. This loss means that some of the variability in age will be lost when ages from zero to 100 are collapsed into, say, 10 categories. As a result, authors may need to report: 1) that this change in measurement level was done, 2) why it was done, 3) where the cut points are that define the new categories, and why these cut points were chosen.
Categorical data (both nominal and ordinal data) can be described by giving the names of the categories (say, survivors and nonsurvivors) and the number or proportion of observations in each category: 820 of the 1000 patients (82%) lived and 180 (18%) died.
Summarizing continuous distributions, on the other hand, require at least two numbers: a measure of center or central tendency, which identifies the bulk of the data, and a measure of dispersion or spread, which tells how much variability there is in the data.
The three most common measures of center are the arithmetic mean, the median, and the mode (Figure 2). The arithmetic mean is simply the average of all the values. The median value is the value that divides the distribution into an upper and a lower half; that is, the value at the 50th percentile of the distribution. The mode is the most common value, although it is usually used to describe a bi- or multi-modal distribution in which there are several peaks in the data, each peak being a modal score.
The interquartile range (IQR) is the most common interpercentile range (Figure 3). It is the difference between the value at the 25th percentile and that at the 75th percentile. In other words, we divide the number of observations (data points) into four equal parts, or quartiles. The value at the 50th percentile between the second and third quartiles is the median. The difference between the 25th and 75th percentiles is the interquartile range. Again, however, what is often reported are the values of the quartiles themselves rather than the range between them.
The variance is calculated by taking the differences between the value of each data point and the mean of the distribution, squaring the differences (to make them positive numbers), and dividing the sum of the squares by the number of data points. The variance is rarely reported, but the term is used descriptively: "The groups had about the same variances," meaning that the variability of the data was about the same in both groups, or "The variance was much larger in the treatment group than in the control," meaning that data in the treatment group were distributed over a wider range.
Technically, the standard deviation is the square root of the variance. For our purposes, the standard deviation is important because it has special properties when the data are normally distributed. Normally distributed data are data that, when graphed, form a symmetrical, bell-shaped or "Gaussian" curve. In a normal distribution, the measures of center are equal: the mean equals the median equals the mode. Also, in a normal distribution, the "area under the curve" (the number of data points between two values) can be expressed in units of standard deviation. About 68% of the values in a normal distribution will be included in the range defined by 1 standard deviation on either side of the mean; about 95% will be within 2 standard deviations on either side of the mean, and about 99% will be within 3 standard deviations on either side of the mean (Figure 4). These proportions are true for any normal distribution, irrespective of the spread of the data, whether the curve is long and flat or short and spiked.
For example, when comparing two groups of different sizes, we have to create a common measure, usually percentages. If 40 of 60 patients in the treatment group survived and 50 of 100 patients in the control group survived, we can't directly compare the 40 with the 50 because these numbers come from different sized groups. Converting the 40 and the 50 into percentages, however, allows us to say that 66% of the treatment group and 50% of the control group survived. The control group had a greater absolute number survivors, but the treatment group had relatively more survivors than did the control group.
We can compare scores from different distributions in the same way, by converting the scores into units of standard deviation. A score equal to the mean value has an SD of zero; half the values are less than the score and half are greater. A score 1 SD above the mean is greater than about 84% of the values (50% plus 34%) and less than about 16%, whereas a score of -1 SD below the mean is greater than about 16% of the values and less than about 84%.
For example, suppose Bob scored 90 of 100 on a biology test, and Maria scored 80 of 100 on a statistics test. We can't compare the 90 to the 80 because the topics are different and because each test has a different distribution of values. If we now express the two scores in terms of standard deviations, we might find that Bob's score of 90 was 2 SD above the mean, and Maria's was 3 SD above the mean. So, Bob did better than about 97.5% of his classmates, but Maria did better than about 99.9% of hers. Maria did relatively better, even though her raw score was less than Bob's.
It is important to remember that the SD indicates these proportions only for normal distributions. So, normal distributions can be appropriately summarized with means and SDs, but distributions of other shapes cannot. Other descriptive statistics are needed to describe skewed or irregularly shaped distributions, especially the median and IQR, but also sometimes the range and the mode (Figure 5).
Lang TA, Secic M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers. Philadelphia: American College of Physicians, 1997. Reprinted in English for distribution within China, 1998. Chinese translation, 2001. Second edition, 2006. Japanese translation, 2011; Russian translation, 2013.