For the discussion of this topic, let us use an example where one hundred hypertensive patients are prescribed an antihypertensive agent. To assess the efficacy of the treatment, BP recordings are taken before and after treatment. One way of reporting the effectiveness of the treatment would be to list all subjects’ BPs before and after treatment in a table. It would obviously be extremely difficult for an observer to assess whether the treatment had been successful. Summing all the BP values before treatment and dividing by the number of observations would provide an average value for the pre-treatment BP; the same could be done for post-treatment. Instead of long lists we would only have to compare two values. The averages pre- and post-treatment could then be compared and an assessment made as to whether the difference was meaningful.
Mean and Median
Figure 1. The distribution of body weights of over 700 dialysis patients (bars) and the cumulative percentage frequency (line)
Figure 1 shows a histogram of the distribution of body weights of over 700 dialysis patients (bars) and the cumulative percentage frequency (line). Imagining a line drawn through the upper points of the bars would produce a “bell shaped” curve; data that have this sort or distribution are called normal distributions. We can calculate the mean (sum of all values divided by the number of observations) and find the value 56.4 kg. To find the median we determine the value for weight that divides the population into two equal halves as shown by the arrow dropping from the cumulative frequency line; the actual value is 55.2 kg. Typically, for such normally distributed data the median and mean are very similar.
Figure 2. The population from above including a small number of patients with high body weights.
Dispersion about the center
Although measures of center provide useful summary information about a data set, they tell us nothing about how the data are dispersed. One way of indicating data dispersion would be to report the range of data values – lowest and highest. In Figure 1 the weights range from 19.94 to 94.8 kg so the data set could be summarized by reporting the median and the range as 55.2 (range: 19.4‑94.8). For the data set in Figure 2 the report would be 55.8 (range: 19.94-180).
Other measures of dispersion are the variance and the standard deviation (SD: = √variance). The variance and SD are calculated from standard statistical formulas. To illustrate this we will use two sets of data as shown in the table.
The data sets are quite similar except that set 2 includes 2 more extreme values. We can see the effect of these differences by expressing the mean, variance and standard deviations for the two series:
For Set 1
Mean = 5.578947
Variance = 7.085873
Standard deviation = ± 2.734873
And for Set 2
Mean = 8.315789
Variance = 62.00585
Standard deviation = ±7.874379
Note the ± sign before the SD; this indicates the dispersion of values above and below the mean.
The very much larger variance and standard deviation of Set 2 indicate the greater dispersion of the data around the mean than for Set 1. As we will see later, the variance is very important when comparing the means of two or more data sets. The means and SD are shown in graphical form in Figure 3.