Approximating the Normal Distribution

Many statistical methods assume approximately normal input data. In this section we will explore the meaning of this assumption for practical applications.

In [22]:
%reload_ext rpy2.ipython

We start with 50 random values from $N(\mu = 10, \sigma = 2)$ i.e. the normal distribution with a mean of 10 and a standard deviation of 2. The normal distribution is also known as Gaussian, and the shape of the histogram is often called the bell curve.

In [23]:
%%R
options(width=60)
x <- rnorm(50, 10, 2)
x
 [1]  9.302138 10.846435 10.042449  7.303005 11.752191
 [6]  6.744523 11.549143  9.588949  8.601280  7.029851
[11] 10.261525 10.875616  6.759858 10.623743 10.450458
[16] 11.746444 10.971672 10.038386 10.020495 10.630805
[21] 11.249859 11.246718 11.967637 12.998127 10.546084
[26] 12.549510 12.347310  8.779948  8.516172 10.894589
[31] 12.424405  8.228309  7.747116 10.078363 15.927236
[36]  9.288054  6.827834  7.891935 11.623930 11.233324
[41] 15.584176 11.879297  8.764259  8.840870 10.602366
[46]  9.078515 10.074753 10.492195  9.151170  5.500051

Of course the sample mean and standard deviation are somewhat different.

In [24]:
%%R
mean(x)
[1] 10.14946
In [25]:
%%R
sd(x)
[1] 2.079491

The histogram shows that the values are only very roughly normally distributed, if at all. We cannot recognize a bell shape.

In [26]:
%%R
hist(x)

When using this code in an R script put png('hist-rnorm.png') before the plot command, and dev.off() afterwards in order to print the plot to an image file. The image can then be included in an HTML page and viewed remotely via the web by putting everything into the www or public_html directory.

With 500 values the resemblence with the bell shape of the normal distribution becomes more apparent:

In [27]:
%%R
x <- rnorm(500, 10, 2)
hist(x, breaks=21)

The normal distribution is a theoretical distribution that is rarely found in nature; however, under certain conditions real-world data can be sufficiently close i.e. approximately normally distributed. To explore that idea let's cast a fair six-sided dice a number of times.

In R we can use the sample() function to draw with equal probability from a given list of values, such as the numbers from 1 to 6. We only print the first few values.

In [28]:
%%R
x <- sample(1:6, 500, replace = TRUE)
x[1:50]
 [1] 2 1 5 4 5 2 6 2 1 5 1 4 6 6 6 3 5 6 3 1 6 3 3 5 4 2 2 5
[29] 3 4 2 2 4 2 2 6 5 5 2 4 2 4 1 5 4 5 5 5 1 4

The hist() function is oriented towards continuous data; for discrete values the combination of plot() and table() produces better results.

In [29]:
%%R
plot(table(x), type="h", lwd=10, ylab="Freq")

The result shows a roughly uniform distribution, as we would expect from a fair dice. What happens if we cast two dice, sum their values, and observe the frequencies of the sums?

To do this in R, we create a sample twice as large, form a matrix with two rows, and compute the column sums:

In [30]:
%%R
y <- sample(1:6, 2000, replace = TRUE) 
m <- matrix(y, 2, 1000)
s <- colSums(m)
s[1:50]
 [1]  7  2  6 10  9  4  5  8  5 10 10  4  7 10  3  8  9  7
[19]  6  4  7  5  7  7  9  6 11 10  8 12  6  9 11 12  5 11
[37]  8  5  7  7  8  7  8  5  7  7  4  7 11  8

Note that when we trow two dice the sum of 2 (and 12) occurs less frequently than sums around 7. Therefore, the sums do not show a uniform distribution in the histogram.

In [31]:
%%R
plot(table(s), type="h", lwd=10, ylab="Freq")

What will happen if we increase the number of dice and casts?

In [32]:
%%R
ndice <- 20
ncast <- 10000
x <- sample(1:6, ncast * ndice, replace = TRUE)
m <- matrix(x, ndice, ncast)
s <- colSums(m)
plot(table(s), type="h", lwd=1, ylab="Freq")

When we repeatedly draw a sufficiently large sample of random values the resulting distribution of the sums is approximately normal.

Note that the distribution in the above figure is still not actually a normal distribution, for several reasons, most importantly

  • the normal distribution is defined on continuous values, while the sum of dice are discrete values
  • in the normal distribution there is a non-zero probability for any value x with $-\infty < x < \infty$, while the sum of n six-sided dice can of course never be less than n or greater than n * 6

What happens to the sums if the original distribution is not uniform?

Consider an unfair dice, such as this one, where the probability of 6 is increased:

In [33]:
%%R
sample(c(1,2,3,4,5,6,6,6), 100, replace=TRUE)
  [1] 1 6 6 6 4 6 6 3 6 4 6 1 2 6 6 2 4 3 5 5 6 2 5 2 1 4 2
 [28] 3 6 2 5 5 6 6 6 6 4 6 6 6 6 4 1 2 6 3 6 2 3 6 3 2 6 3
 [55] 4 6 3 1 3 6 5 5 6 1 6 1 6 1 6 1 4 4 4 1 1 1 6 6 6 2 6
 [82] 6 4 5 6 6 6 6 6 1 6 6 3 2 1 6 6 2 4 3

This dice will produce the result 6 much more frequently than a fair dice, so the distribution of the individual dice values is far from uniform.

In [34]:
%%R
x <- sample(c(1,2,3,4,5,6,6,6), 1000, replace=TRUE)
plot(table(x), type="h", lwd=10, ylab="Freq")

We repeat out experiment, this time drawing from the modified sample list:

In [35]:
%%R
lst <- c(1,2,3,4,5,6,6,6) 
ndice <- 20 
ncast <- 10000
x <- sample(lst, ncast * ndice, replace = TRUE)
m <- matrix(x, ndice, ncast)
s <- colSums(m)
plot(table(s), type="h", lwd=1, ylab="Freq")

When we plot the sums of 20 unfair dice over 10000 tosses the distribution has shifted to the right, but other than that the picture again shows the familiar bell shape.

The observations in this section are a consequence of the central limit theorem. There are several aspects to this theorem; a common formulation is:

The sum of many independent and identically distributed random variables with finite variance will tend to be distributed according to the normal distribution.

This is happening in many practical situations, although whether the approximation to the normal distribution is actually close enough for reliable results from statistical tests and other methods depends on the application.

From a practical point of view, we would like to have at least some ballpark figure: How many values do we need to sum in order to achieve "approximately normal", regardless of the distribution of the individual values? For the next sections and the examples in this text, as well as many situations involving real-world data, the short answer $N\ge 20$ will serve as a very rough guide.

In [ ]: