A series of basic statistics by Tom Lang

1. Understanding Variables, Levels of Measurement,and Descriptive Statistics

Introduction

Science is the search for relationships. Science essentially advances by identifying, describing, predicting, and eventually sometimes controlling relationships. In turn, relationships are comprised of variables.
Several types of variables are common in biomedical research. For simplicity, this series of articles will focus only on two (explanatory and response variables), but I will briefly describe some other types. Once the variables have been identified, data are collected about them. These data can be collected at a given "level of measurement," which determines how much information is collected about the variable. Finally, these data have to be described using certain "descriptive statistics," which summarize them so that they can be communicated and analyzed more easily.

Types of Variables

A variable is a characteristic of a person, place, or thing that has two or more observable or measurable values and that can therefore distinguish one case from another. (Science depends on measurement: if we can't measure something, we can't do science on it. However, measurement is a separate, if fascinating, topic that I won't address in these articles.)
For example, in a study of mothers, sex is not a variable because the sample consists only of women. In a study of parents, sex can be a variable because the sample contains both men and women; sex can be used to distinguish one case from another.
The titles of scientific articles often (and should) identify the relationships studied in the research. In an article titled, The Effects of Aspirin on Treating Headache, the researchers studied the relationship between aspirin, defined as the drug in some preparation or dosage, and headache, defined probably by reports of how much pain is reported by patients with headache.
A response (or dependent) variable is the outcome of interest; an endpoint. It is the variable presumed to be acted on by another variable(s) and is the reason the study is conducted. An explanatory (or independent) variable is the exposure or treatment that we believe is responsible for changes in the response variable. Its value is usually known or controlled by the researcher. In the above example, the response variable is the degree of pain reported and the explanatory variable is the intervention, aspirin.
When reading a research article, it is good practice to identify the response and explanatory variables used. Variables are often identified in: titles, key words, abstracts, introductions, and especially in tables and graphs, which show the data collected on each variable

Other types of variables are commonly used in research:. ・Control variables which are variables whose values are kept constant so they don't interfere with interpreting the relationship studied. For example, in a study of aspirin, if every patient received the same dose, blood concentrations would probably be higher in smaller patients than in larger ones. To "control" for weight, we could 1) enroll only patients who weight about the same, 2) analyze the data by dividing the patients into categories based on weight (very light, light, normal, heavy, and very heavy), or 3) analyze the data with a statistical procedure that would consider weight as a variable in the context of other variables, such as regression analysis. ・Confounding variables which are variables that correlate (directly or inversely) with both the response variable and the explanatory variable but that are not part of the "causal chain" under study. Such variables can make interpreting the results difficult when trying to determine whether the change in the explanatory variable really had an effect on the response variable. For example, changes in the amount of ice cream consumed in the US seem to be associated with the number of murders. If this is all the information we had, we might conclude that 1) someone who eats ice cream may be more likely to kill someone or that 2) murders eat more ice cream. The relationship between ice cream and murder, however, is "confounded" by the fact that temperature is correlated with both ice cream eating and the murder rate: the hotter the weather, the more people tend to eat ice cream, the more time people spend outside, and the more easily they get angry. (There is more to this relationship than described, but the example makes the point.) ・Extraneous variables These undesirable variables influence the outcome of a study, but they are not the variables of interest. In other words, they add error or "noise" that may get in the way of understanding the relationship of interest. Suppose a test is given to one group in the morning and to another in the afternoon. It may be that the morning group is refreshed and more alert and the afternoon group is more tired and less alert. Here, time-of-day could be an explanatory available.

Levels of Measurement

Once we have variables, we need to collect data about them. We can do so at one of four "levels" or "scales" of measurement." Each level is defined by how much information an observation or a measurement has. There are two broad categories of levels of measurement, categorical and continuous. Not surprisingly, categorical data consist of categories, and continuous data are measurements on a scale of equal intervals. Categorical data are sometimes called qualitative data because some quality of the observation is used to put the observation into one (and only one) of a series of categories. Likewise, continuous data are sometimes referred to as quantitative data because a quantity of something is being measured. The Nominal Level of Measurement At the lowest level of measurement is the "nominal" or named scale of measurement (Figure 1). Nominal data consist of two or more categories of observations that have no inherent order. Examples are blood type (A, B, AB, and O); an intervention (aspirin, ibuprofen, acetaminophen, naproxen); and treatment center (Tokyo, Osaka, Yokohama). In these examples, blood type, the intervention, and treatment center are the variables, and the categories listed are the values of each variable.
Figure 1 Examples of categorical data. Treatment is a bi-nominal variable with two categories, the treatment and the control groups. Time from notice to response is an ordinal variable with three ranked categories, 0 to 7 days, 8 to 14 days, and 15 to 21 days.
Another important type of nominal data is called binomial ("two names") data, or two categories, such as alive or dead, male or female, heads or tails in a coin flip, and treatment or control, for example. The Ordinal Level of Measurement The second level of measurement, which is also categorical, is the ordinal scale of measurement (Figure 1). Ordinal data consist of two or more categories of observations that do have an inherent order. Treatment groups might consist of low, medium, and high doses, for example, or infant, child, adolescent, young adult, or older adult. The data are still categorical, but they are ranked by the nature of what is being measured. In the age categories listed, we may not know how old the young adult is, but we know that everyone in that category is older than the those in the infant, child, and adolescent categories and younger than the category of older adults. Again, the variables in this example are medication strength and age.
The categories in ordinal data are ranked, but not on a scale of equal units. Suppose we asked patients to indicate on a scale of 1 (very uncomfortable) to 5 (very comfortable) how satisfied they were with the care they received at a hospital. We have five ordinal categories. Even though these categories are numbers, it would be inappropriate to say that someone answering with a 4 was twice as comfortable as someone answering with a 2. All we can say is that one person was more satisfied than the other.
Sometimes, however, ordinal categories are treated as if they were an equal distance apart. For example, patients are often asked to indicate their degree of pain by choosing a number between zero (no pain) and 10 (the worst pain imaginable). If the preoperative pain was 8 and the postoperative pain was 2, the data may be interpreted to mean that the drop of 6 points represented a 75% reduction in pain (6/8=0.75). Because mathematical operations can sometimes be legitimately performed with ordinal categories, this level of measurement is also referred to as semi-quantitative data. The Continuous Level of Measurement The third level of measurement, and the level with the most information, is the continuous level. Continuous data are counted or measured on a scale of equal intervals and that, when graphed, form a distribution. There are two categories in continuous data. Discrete data, or interval data, do not have fractions. Counts of patients, for example, don't have fractions. Truly continuous data do have fractions: serum adrenaline concentrations can be measured in milliliters per kilogram to several decimals places, for instance. Because continuous data are measurements on a scale of equal intervals, they can be sensibly added, subtracted, multiplied, and divided. Age in years allows us to say that a 50-year-old patient is twice as old as a 25-year-old patient.
Because discrete and continuous data are handled similarly in statistical analyses, I will use continuous data to refer to both categories.
Researchers can often choose the level of measurement they want to use. In a study of blood pressure, a researcher might want to study hypertensive patients vs. non-hypertensive patients (a nominal/binomial level of measurement), hypotensive vs. normotensive vs. hypertensive patients (an ordinal level), or blood pressure measured in millimeters of mercury (a continuous level).
Sometimes, data collected at a continuous level of measurement are separated into a series of ordinal categories. Age measured in years at a continuous level may be turned into ordinal data by using age groups of, say, deciles (groups that each span 10 years: 0 to 9, 10 to 19, 20 to 29, and so on). However, when moving from a continuous to an ordinal level of measurement, information is lost. This loss means that some of the variability in age will be lost when ages from zero to 100 are collapsed into, say, 10 categories. As a result, authors may need to report: 1) that this change in measurement level was done, 2) why it was done, 3) where the cut points are that define the new categories, and why these cut points were chosen.

Descriptive Statistics

Now that we have variables with data collected at various levels of measurement, we need to communicate the data to others. We can do so with words and with images.
Categorical data (both nominal and ordinal data) can be described by giving the names of the categories (say, survivors and nonsurvivors) and the number or proportion of observations in each category: 820 of the 1000 patients (82%) lived and 180 (18%) died.
Summarizing continuous distributions, on the other hand, require at least two numbers: a measure of center or central tendency, which identifies the bulk of the data, and a measure of dispersion or spread, which tells how much variability there is in the data.
The three most common measures of center are the arithmetic mean, the median, and the mode (Figure 2). The arithmetic mean is simply the average of all the values. The median value is the value that divides the distribution into an upper and a lower half; that is, the value at the 50th percentile of the distribution. The mode is the most common value, although it is usually used to describe a bi- or multi-modal distribution in which there are several peaks in the data, each peak being a modal score.
Figure 2 The three most common measures of center on a non-normal or skewed distribution. The mode is the most common value, the median divides the distribution into an upper and a lower half, and the mean is the arithmetic average of all the values. The mean has been pulled to the right by the higher values at the high end of the scale. Such a distribution is called "right skewed" because of these higher values.
The four most common measures of dispersion are the range, the interpercentile (usually the interquartile) range, the variance, and the standard deviation. Here the range is the difference between the minimum and maximum values, although these values are often thought to be the range itself. Scores from 10 to 15 have a range of 5, but we say the "scores ranged from 10 to 15." Giving the minimum and maximum values fixes them on the measurement scale, whereas giving only the range doesn't indicate where on the scale the data are. Values of 1005 to 1010 also have a range of 5, but at a very different location on the scale.
The interquartile range (IQR) is the most common interpercentile range (Figure 3). It is the difference between the value at the 25th percentile and that at the 75th percentile. In other words, we divide the number of observations (data points) into four equal parts, or quartiles. The value at the 50th percentile between the second and third quartiles is the median. The difference between the 25th and 75th percentiles is the interquartile range. Again, however, what is often reported are the values of the quartiles themselves rather than the range between them.
Figure 3 The interquartile range. The minimum and maximum values in the top two panels are similar, but the distributions are clearly not. Data in the top panel are more dispersed than those in the second panel, a difference reflected in the values at the 25th and 75th percentiles. Remember that the interquartile range is actually the difference between the 25th and 75th percentiles, although the values of these
The variance and standard deviation are a little harder to explain. The math is not difficult, but we'll skip the details here.
The variance is calculated by taking the differences between the value of each data point and the mean of the distribution, squaring the differences (to make them positive numbers), and dividing the sum of the squares by the number of data points. The variance is rarely reported, but the term is used descriptively: "The groups had about the same variances," meaning that the variability of the data was about the same in both groups, or "The variance was much larger in the treatment group than in the control," meaning that data in the treatment group were distributed over a wider range.
Technically, the standard deviation is the square root of the variance. For our purposes, the standard deviation is important because it has special properties when the data are normally distributed. Normally distributed data are data that, when graphed, form a symmetrical, bell-shaped or "Gaussian" curve. In a normal distribution, the measures of center are equal: the mean equals the median equals the mode. Also, in a normal distribution, the "area under the curve" (the number of data points between two values) can be expressed in units of standard deviation. About 68% of the values in a normal distribution will be included in the range defined by 1 standard deviation on either side of the mean; about 95% will be within 2 standard deviations on either side of the mean, and about 99% will be within 3 standard deviations on either side of the mean (Figure 4). These proportions are true for any normal distribution, irrespective of the spread of the data, whether the curve is long and flat or short and spiked.
Figure 4 The "areas of the curve" in a normal distribution as indicated in units of standard deviations. In a normal distribution, the mean, median, and mode are equal and correspond to a SD value of zero. The area can be calculated for any SD. A SD of 1.3 ind icates a value that is larger than 90% of all values and smaller than 10%, and one of -0.6 and indicates a value that is larger than 27% of all values and smaller than 73%.

For example, when comparing two groups of different sizes, we have to create a common measure, usually percentages. If 40 of 60 patients in the treatment group survived and 50 of 100 patients in the control group survived, we can't directly compare the 40 with the 50 because these numbers come from different sized groups. Converting the 40 and the 50 into percentages, however, allows us to say that 66% of the treatment group and 50% of the control group survived. The control group had a greater absolute number survivors, but the treatment group had relatively more survivors than did the control group.
We can compare scores from different distributions in the same way, by converting the scores into units of standard deviation. A score equal to the mean value has an SD of zero; half the values are less than the score and half are greater. A score 1 SD above the mean is greater than about 84% of the values (50% plus 34%) and less than about 16%, whereas a score of -1 SD below the mean is greater than about 16% of the values and less than about 84%.
For example, suppose Bob scored 90 of 100 on a biology test, and Maria scored 80 of 100 on a statistics test. We can't compare the 90 to the 80 because the topics are different and because each test has a different distribution of values. If we now express the two scores in terms of standard deviations, we might find that Bob's score of 90 was 2 SD above the mean, and Maria's was 3 SD above the mean. So, Bob did better than about 97.5% of his classmates, but Maria did better than about 99.9% of hers. Maria did relatively better, even though her raw score was less than Bob's.

It is important to remember that the SD indicates these proportions only for normal distributions. So, normal distributions can be appropriately summarized with means and SDs, but distributions of other shapes cannot. Other descriptive statistics are needed to describe skewed or irregularly shaped distributions, especially the median and IQR, but also sometimes the range and the mode (Figure 5).

Figure 5 A summary of descriptive statistics. a)The data in the order in which they were collected.
b) Graphing these data reveals the minimum and maximum values, and the cluster of values in the center of the distribution is obvious.
c) The values listed in order showing the median, interquartile range, and standard deviations. The median at the 50th percentile is 79.5 because it falls between 79 and 80. The 25th and 75th percentiles are 74 and 84, respectively, and one standard deviation on either side of the mean (at 79.1), which includes about 68% of the values (here, 34 of the 50 data points), is defined by the values at 71.4 and 86.8.
d) The standard descriptive statistics for the 50 data points.
The SD is commonly used in biomedical research, but it has both advantages and disadvantages. It underlies many statistical procedures and calculations and is useful for teaching various concepts. However, because most biological data are not normally distributed, the median and IQR should be used much more than they usually are. Many authors summarize all their data with means and SDs, irrespective of the shape of the distributions, a practice that is inaccurate and discouraged by most authorities. In fact, it is the single most cited statistical reporting error in the literature.

Bibliography

Rowntree D. Statistics Without Tears: An Introduction for Non-Mathematicians. London: Penguin Books, 2000
Lang TA, Secic M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers. Philadelphia: American College of Physicians, 1997. Reprinted in English for distribution within China, 1998. Chinese translation, 2001. Second edition, 2006. Japanese translation, 2011; Russian translation, 2013.