Print PDF
Regression is similar to correlation and the two are often confused. Although regression also seeks to define the relationship between two or more data sets, the aim is not so much to detect a correlation as to build a mathematical model that would allow the value of one or more dependent variables (response variables) to be predicted in response to an independent (cause, stimulus) variable.
The regression models that can be built using statistics range from simple univariate (one stimulus and one response) to multivariate (several responses to one stimulus) to multiple (one or more responses to more than one stimulus). Linear regression models are those where the response changes in a linear fashion. Responses to stimuli, however, may not follow a linear path but may show a curvilinear pattern or even a parabolic (U shaped) pattern. In some cases the plot of the raw stimulus and response data shows a curvilinear shape, but transformation of one or both data sets can convert the relationship to linear (e.g., log transformation) numbers.
Simple Linear Regression
In a hemodialysis unit the clinician is interested in predicting the fall in systolic BP (SBP) during the first hour of dialysis according to the applied rate of ultrafiltration (UFR). The change in SBP and the UFR are recorded during 33 dialysis sessions. The data are then plotted as a scatter chart as seen here.

Click image to enlarge
There appears to be a relationship whereby as the rate of UF increases there is an increase in the fall in SBP (more negative) although the responses to UF are quite variable. Using correlation statistics we could confirm this relationship. However, our main interest is to be able to predict the degree of fall of SBP in response to UF. By a method called least squares analysis we can plot a line through the points on the graph that best fits the data; this is something like finding a “mean” for the SBP response at each UFR. The line so derived is shown in the chart below. This line can be expressed mathematically by the formula Y = bX + a where ‘a’ is the imaginary or real value of Y when X=0 ([Y=b*0 + a] = [Y=a]). ‘b’ is the slope of the line which is the amount of change in Y for each unit of change of X; if ‘b’ is negative, Y falls for each unit rise in X; if ‘b’ is positive, Y rises for each unit increase in X.

Click image to enlarge
The line in the right chart above has the equation Y = -0.0087X + 5.82 (where Y = change in SBP mmHg and X = UFR ml/hour). This linear regression model predicts that for each increase in UFR by 100 ml/hour the SBP will fall by 0.87 mmHg (100*0.0087 = 0.87). The correlation coefficient r = 0.86 suggests that the model fits the data quite well; statistics also provides additional methods to test the “goodness of the fit” of the model which we need not explore here. Furthermore, the coefficient of association R2 = 0.74 tells us that 74% of the change in SBP is actually due to the change in UFR and 26% is due to some other effect not definable in this study.
Non-linear regression
When plotted data do not assume a linear relationship it is useful to replot the data with one or both sets of values transformed. Various transformations can be tried (e.g., using log transformation, invert values, power values, etc.) and experience usually informs which will work best. The results of the regression model can then be converted back to the normal value to derive the predictions required. However, some data are not linear and cannot be converted to linear even with transformation. An example is shown here.
%20sm.gif)
Click image to enlarge
The X values here represent the dose of an agent known to stimulate a metabolic function in cells which is indicated by the Y values. The relationship is obviously not linear. In fact, initially, increasing dose of X enhances the Y response but above a certain dose of X the Y response seems to be inhibited. It is possible to model this non-linear regression, although there are many hazards in building such models.
%20sm.gif)
Click image to enlarge
On the chart one such model is shown and, in mathematical terms, has a polynomial function. The R2 value for the fitted regression suggests a good fit but this would have to be stringently tested by appropriate statistical methods.
Logistic regression
How can we model regression when the response variable is categorical (yes/no: on/off) rather than continuous – for example death in response to categorical or continuous independent variables (also known as covariates). Does being diabetic (yes/no) affect the mortality of dialysis patients? Does the dose of delivered dialysis (Kt/V - a continuous numerical variable) affect death rate? It would be possible to measure death rates in diabetic and non-diabetic patients and compare the results to indicate an effect, but such a method would not permit the meaningful prediction of the effect of diabetes on death.
To develop a statistical model that enables the quantifiable prediction of the effect of a variable on a categorical event, the basic statistic is called Logistic Regression. The math is more complex and need not be developed here. The output of logistic regression models are usually expressed as the odds ratio that the categorical event will occur according to the value of the dependent variable (if a continuous variable with a numeric scale) or whether the dependent categorical covariate is present. An example from the results of a logistic regression analysis is shown.

Click image to enlarge
Diaz-Buxo and colleagues analyzed the impact of a number of demographic factors and laboratory results on the death of PD patients in the USA. Only three covariates are shown here – age, patient sex and diabetes. Age is a continuous variable measured in years. Sex is a categorical variable, male or female, and diabetes is a categorical variable. Note the way the data are presented: each data point is a horizontal bar with a vertical line extending above and below the data point. This looks like the mean and SD results shown in Figure 3 but is not. The vertical bar represents what is known as the confidence interval for the odds ratio.
If the odds ratio (OR) is greater than 1 the effect of an increased value of the variable or the presence of the covariate is to increase the probability of the event (death in this case); if less than 1 the effect is to reduce the event probability. The nearer the OR to the value 1 the less the effect and an OR of 1 indicates no effect. The statistic reports the OR and a confidence interval (CI) for the OR. For example, for diabetes in the above example the OR is 1.44 and the 95% confidence interval is 1.09 to 1.9; what is meant by the confidence interval? The interpretation is as follows; the odds ratio of death for diabetics compared to non-diabetics is 1.44 and there is 95% certainty that the OR lies somewhere between 1.09 and 1.9. It should be apparent that if the 95% CI includes the value 1 then there is a chance (at least 5%) that there is no effect. We can therefore state that if the CI for the OR includes the value 1 the detected effect is not significant.
The figure above shows that for each 1-year additional age the OR of death is 1.043. The confidence interval (1.032 to 1.055) does not include 1; the vertical bar does not cross the line for the value 1. Another way of expressing this would be to state that for each additional 1 year of age the odds of dying are increased by 4.3% (1.043 = 1 + 0.043: 0.043 = 4.3%). Females (compared to males) have a 25% reduced odds of dying; note, however, that the upper part of the CI bar comes very close to 1 (actual CI = 0.56 to 0.99) so that the significance of this OR is lower. On the other hand, having diabetes increases the odds of dying by 44% (OR = 1.44) and the vertical bar of the CI is distant from 1.
Logistic regression also builds a predictive model for the event incorporating those variables and covariates that have a statistical significance. The model equation has the basic form Y = bx + a that we saw earlier for linear regression. In the example from Figure 8 there are three variables so that the model would actually look like this: Y = b1x1 + b2x2 + b3x3 + a. With the values for the three coefficients, b1 to 3, we could predict the combined effect of the three variables on death odds. Moreover, the statistical method would provide measures of goodness-of-fit for the model.
Back to Basic Statistics Next