A series of basic statistics by Tom Lang

6. Correlation and Linear Regression Analysis

Introduction

Two variables are said to be related when a change in one is accompanied by a change in the other. When two nominal (or sometimes, categorical) variables are related, they are said to be associated. When two continuous (or sometimes categorical) variables are related, they are said to be correlated. This article describes correlation analysis.

Correlation Analysis

Correlation can be graphed, which makes it easier to see. One variable is plotted on the X-axis and the other on the Y-axis. The resulting graph, sometimes called a "scatter plot" because the data points are "scattered" over the data field, is easily interpreted. A circle or oval drawn around most of the data will indicate the relationship between the variables (Figure 1). An oval that rises from lower left to upper right indicates a positive relationship, in which the variables increase together, whereas an oval that drops from upper left to lower right indicates a negative relationship, in which one variable increases and the other decreases. An oval that is almost circular indicates no relationship: for any value of X, Y has more than one value.
Figure 1 Scatter plots showing data that are
a) positively correlated, B) negative correlated, and C) not correlated.
The most common measures of association are:
Pearson's product-moment correlation coefficient, r, which assesses the relationship between two normally distributed, continuous variables. Spearman's rank correlation coefficient, rho (ρ, the Greek letter, pronounced "row"), which assesses the relationship between two continuous variables of a distribution of any shape. Kendall's rank correlation coefficient, tau (τ, the Greek letter, pronounced "tah-ow"), which assesses the relationship between two ordinal variables or between one ordinal and one continuous variable.Intraclass and interclass correlation coefficients assess agreement within or between raters, respectively, who provide judgements on the same quantity. These coefficients are often seen in studies of diagnostic procedures, where two raters evaluate the same image, such as a radiograph or a pathology slide.
Unlike association analysis, in which association is usually declared to be present or absent on the basis of the P value being statistically significant or not, correlation is a matter of degree. All of the above correlation coefficients have values ranging from –1 to +1, where +1 is a perfect positive correlation, in which the both variables increase together; 0 is no correlation, meaning that the variables are unrelated; and -1 is a perfect negative correlation, in which one variable increases as the other decreases.
The fact that correlation is not established as present or absent based on the P value also means that the result has to be interpreted. Describing the correlation as weak, moderate, or strong depends on the medicine involved, not on the size of the correlation coefficient itself, despite what some authors have proposed (Figure 2). For example, the concentration of a substance in an IV infusion should be highly correlated with its concentration in the blood. In such a case, the serum a correlation coefficient of 0.85—often described as a high correlation in may instances— may be unacceptably low.
Figure 2 The strength of the correlation depends on the medicine, not on descriptive terms, as shown here. For example, the concentration of a substance in an IV infusion should be highly correlated with serum concentration; r = 0.85 may be unacceptably low
Correlation coefficients are estimates and so should usually be accompanied by confidence intervals.

Simple Linear Regression Analysis

Linear regression analysis predicts the value of a continuous variable whose value is unknown from one or more variables whose values are known. When there is only one known variable whose value is known, the analysis is described as "simple." If there are two or more known variables whose values are known, the analysis is described as "multiple."
Linear regression analysis extends correlation analysis by "fitting" a "least-squares regression line" to the scatter plot. Basically, this line is the line that passes as close to all the data points. (Actually, it is the line with the smallest "sums of squares," in which the distance from each data point to the line is squared and the line with the smallest sum of squares—the "least squares line"—is the line that best summarizes the data. Figure 3)
Figure 3 The "least-squares regression line" fitted to the scatter plot. This line has the smallest "sums of squares," in which the distance from each data point to the line is squared and the line with the smallest sum of squares—the "least squares line"—is the line that best summarizes the data. The equation for this line is the simple linear regression model.
Linear regression analysis also assumes that the known and unknown values are linearly related; that is, the data must be adequately summarized by a least-squares line. So, a simple linear regression "predictive model" is the basic algebraic equation for a line: y=mx + b, where y is the variable whose value we want to predict, m is the slope of the line (here, the regression coefficient or the "beta weight" that indicates how much y will change for each unit change in x), x is the known value, and b is the "y intercept point," or the value of y where the regression line crosses the y axis (Figure 4).
Figure 4 A simple linear regression "predictive model" is the basic algebraic equation for a line: y=mx + b, where y is the variable whose value we want to predict, m is the slope of the line (here, the regression coefficient or the "beta weight" that indicates how much y will change for each unit change in x; the orange line), x is the known value, and b is the "y intercept point," or the value of y where the regression line crosses the y axis: here, 1. In this model, when X is 6, the model predicts that Y will be 5.5. When X is 9, the model predicts that Y will be 7.5.
The assumption of a linear relationship needs to be confirmed and reported with the analysis. This assumption can be tested with an "analysis of "residuals." A residual is just the difference between the measured values of x and y and the values predicted by the line (Figure 5). If the relationship between x and y is linear, a graph of residuals will consist of a small band close to a zero difference that spans the length of the X-axis. That is, the differences between each measured values of y and the predicted values of y for a given value of x are all small for all values of x, and the model will predict the value of y well (Figure 6).
Figure 5 A residual is the difference between actual value and the value predicted by the model; here the gray line. Graphing the residuals help determine whether the underlying relationship among the data is linear.

Figure 6 Graphs of residuals for different hypothetical simple linear regression models. A)A graph confirming the linearity of the data.
      The differences are small and close to the line of
      zero differences throughout the range of values on the X-axis.
B)A graph showing that the underlying relationship is linear,
      z but there is much more variability in the data, meaning the model will probably
      znot predict as well as the model shown in A.
On the other hand, if the graph of residuals is a wide horizontal line along the range of x, the relationship is still linear, but the variability of the data is greater, which means the model will not predict quite as well. Any other shape of a graph of residuals indicates a non-linear relationship (Figure 7).
Figure 7 Models in which the data are not linearly related and so should be analyzed with another form of regression analysis.
In addition to confirming that the assumption of linearity has been met, the "coefficient of determination," which is the square of the correlation coefficient, r2, should be reported. The importance of r2 is that it indicates how good the model is: how much variability in y is explained by knowing x. Values closer to zero means that the model does not predict well, and values closer to 1 indicate that it predicts better.
Finally, the model needs to be "validated," or tested to determine whether it is really modeling the data well. One form of validation is to develop the model on, say, 80% of the data and then to see how well it predicts in the remaining 20%. If the values of r2 are similar, the model is considered to be validated. Another way to validate the model is to compare it with an existing model developed by someone else on a similar set of data. Again, if the values of r2 are similar, the model is considered to be validated.
An example of a correctly reported simple regression analysis is shown below.

The simple linear regression model we developed for predicting serum drug concentrations from weight was: Y = 12.6 + 0.25X. The slope of the regression line was significantly greater than zero, indicating that serum levels tend to increase as weight increases (slope = 0.25; 95% CI = 0.19 to 0.31; t451 = 8.3; P < 0.001; r2 = 0.67)

Where:

・Y is the drug concentration in mg/dL

・12.6 is the Y-axis intercept point

・X is weight in kg

・0.25 is the slope of the regression line or the regression coefficient or the beta weight;
 for each additional kilogram of weight, drug concentration increased by 0.25 mg/dL

・0.19 to 0.31 is the 95% confidence interval for the slope of the line:
 if the study was repeated 100 times with data from the same population,
 we would expect the slope of the regression line to fall between 0.19 to 0.31 in 95 of
 the studies

・t451 = 8.3 is the value of the t statistic with "451 degrees of freedom,"
 numbers that are an intermediate step to determining the P value

・P < 0.001 is the probability that the slope of the line would differ from a slope of zero
 (a flat, horizontal line) if there were no relationship between x and y

・r2 is the coefficient of determination; a patient's weight explains 67%
 of the variation in drug concentrations

MULTIPLE LINEAR REGRESSION ANALYSIS

Multiple linear regression analysis is similar to simple linear regression analysis, although it can't be graphed because it predicts the unknown value of a variable from two or more variables with known values. The presence of two or more predictors adds a few more steps in the model-building process.
Below is an example of a multiple linear regression model with four variables, X1 through X4. The number before each variable is the regression coefficient or beta weight that indicates how much the value of Y will chance for each unit change in X.
Y = 12.6 + 0.25X1 + 13X2 - 2X3 + 0.9X4
The first step is to determine the relationship between each of the predictor variables with outcome variable one at a time. This analysis is called an "unadjusted" analysis because it doesn't involve a second variable. (The analysis is also sometimes called "univariate analysis," because only one possible predictor is being compared at a time, or "bivariate analysis," because one predictor variable is being compared with one outcome at a time, which is, of course, two variables. Each of these three terms is correct, but you will see all three in the literature.)
Individual predictor variables that are significantly related to the outcome variable are called "candidate variables" because they will be considered for inclusion in the final model. Often, the threshold of statistical significance will be higher than the typical 0.05, such as 0.2, to make sure that all predictors the might even be remotely related to the outcome will be considered.
Once the candidate variables have been identified, they should be evaluated for "colinearity" and "interaction." Co-linear variables are highly correlated, such that they add the same information to the model. Height and stride length (the distance a person moves with each step) are highly correlated, for example, so only one needs to be included in a model.
Two variables are said to interact if the combination produces results that are greater than the results of each variable individually. For example, taking barbiturates and drinking alcohol at the same time can be fatal, even if the dose of each drug by itself is not. In this case, and "interaction term" that models the interaction between the two variables is created and added to the model.
When all the candidate variables are identified, and co-linear variables have been eliminated and interaction variables have been added, the variables are put into a "variable selection process," in which they are combined in various ways to create several regression models. Each model has a "coefficient of multiple determination," R2, which is the same as the coefficient of determination, r2, in simple linear regression, except that it applied to multiple regression. The model with the largest R2 will predict the outcome best and is chosen as the final model.

An example of a correctly reported multiple linear regression analysis is shown in the Table.
Table A Table for Reportiong a Multiple Linear Regression Model with Three Explanatory Variables

Bibliography

Lang TA, Secic M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers. Philadelphia: American College of Physicians, 1997. Reprinted in English for distribution within China, 1998. Chinese translation, 2001. Second edition, 2006. Japanese translation, 2011; Russian translation, 2013.