Approximating the Normal Distribution

Let's get 50 random values from N(μ = 10, σ = 2) i.e. the normal distribution with a mean of 10 and a standard deviation of 2:

  x <- rnorm(50, 10, 2)
  x

 [1]  7.155052 12.317898  7.272548  7.961436 12.168885 12.131066 13.251959
 [8] 11.390640  9.092544 12.335076 14.215988 10.380127 13.132298  8.148629
[15]  6.530989 11.305005  9.879174  8.755643  9.028988  7.311430  7.793305
[22]  8.964536  8.064233  7.974617 12.104212  9.179521 11.352597  8.852652
[29]  7.939722 10.879661  6.906249 10.068695  9.962400 11.973810  8.428297
[36]  8.068440  8.733865 12.593126  5.044776  6.379441  6.922262  6.456940
[43] 11.778653 10.394324 10.886846 11.258624 11.845888  8.691037 10.054030
[50] 12.782038

  mean(x)
[1] 9.722003
  sd(x)
[1] 2.199985
 
  png('hist-rnorm.png')
  hist(x)
  dev.off()

An option to view the plot even if you work remotely (i.e. the computer running R is not your workstation) is to put everything into the www/ directory, so you can view files with your web browser, including JPG images; assuming of course that there is a web server running on the remote host. Enter the URL to your home page into the location bar, and add the filename, usually with this pattern:

http://servername/~userid/hist-rnorm.png

With 500 values the distribution resembles the normal distribution more closely:

  x <- rnorm(500, 10, 2)
  png('hist-rnorm500.png')
  hist(x, breaks=21)
  dev.off()

Many statistical methods make assumptions about the data analysed, and one of the most commonly encountered assumptions is that the data should be approximately normally distributed. To explore that concept let's cast a fair six-sided dice a number of times.

In R we can use the sample() function to draw with equal probability from a given list of values, such as the numbers from 1 to 6:

  x <- sample(1:6, 1000, replace = TRUE)
  x
   [1] 6 1 6 4 3 3 2 4 4 1 6 3 1 1 3 1 4 5 3 2 2 2 4 6 5 1 6 6 5 2 2 6 5 2 3 4 1
  [38] 6 4 6 4 4 1 6 6 1 1 5 3 5 3 2 4 1 6 2 2 5 4 4 4 3 4 3 1 6 3 3 3 2 3 1 4 5
  [75] 4 5 6 4 5 4 6 5 4 6 2 4 1 4 5 4 1 6 6 2 1 4 5 6 1 1 6 3 6 4 3 2 2 4 3 6 2
 [112] 3 4 1 3 5 2 5 3 3 4 5 4 4 2 1 2 4 1 5 2 2 6 3 3 1 2 1 2 5 6 1 3 6 5 6 5 4
 ...

The hist() function is oriented towards continuous data; for discrete values the combination of plot() and table() produces better results:

  png('dice.png')
  plot(table(x), type="h", lwd=10, ylab="Freq")
  dev.off()

The result shows an almost uniform distribution, as we would expect from a fair dice.

What happens if we cast two dice, sum their values, and observe the frequencies of the sums?

To do this in R, we create a sample twice as large, form a matrix with two rows, and compute the column sums:

    y <- sample(1:6, 2000, replace = TRUE) 
    m <- matrix(y, 2, 1000)
    s <- colSums(m)
  s
   [1]  8  9  9  8  8  8  4  2  6 10  7  7 11  5 10  8  6 11  6 10  8  7  8  9
  [25]  4  9  9 11  5 11  8  6  7  4  7 10  6  7  7  9  6  7  9  7  8  7  4  3
  [49]  7  6  9  6 10  5  5  4  8  7  2  6  7  8  5  7  5  6  3 11  8  4  7 11
  [73]  8  7  3  6  8  3  6  8  5 10 10  9  4  7  8  7  7  2  8  6 12  9  7 10
  [97] 10  9 11  8  7  8  8  8 10  9  5  2  8  4  7  7 11  9  4  8  6  8  4 10
 [121]  2 10  8  7  2  5  6  5  9  8  5  7  8 10 11  7  5  5  6  5  9  8  4  7
 ...

Note that when we trow two dice the sum of 2 (and 12) occurs less frequently than sums around 7.

Therefore, the sums do not show a uniform distribution in the histogram:

  png('sums.png')
  plot(table(s), type="h", lwd=10, ylab="Freq")
  dev.off()

What will happen if we increase the number of dice? Let's define a function to simplify our work:

dice <- function(ndice, ncast, fn) {
  png(fn)
  x <- sample(1:6, ncast * ndice, replace = TRUE)
  m <- matrix(x, ndice, ncast)
  s <- colSums(m)
  plot(table(s), type="h", lwd=1, ylab="Freq")
  dev.off()
  return(1)
}

At this point it's probably a good idea to put all your R statements into a text file, so you don't have to re-type them interactively lateron. Create a plain text file with the editor of your choice, such as pico:

pico myfile.r

Put all the statements you want to be executed into that file, including function definitions like the one above. Then, have R execute that file:

R --no-save --quiet < myfile.r

Having defined our function dice() we can call it with the parameters ndice (number of dice being cast), ncast (how many times we cast), and fn (filename for the histogram). The resulting distribution of the sums shows a shape quite similar to the normal distribution:

dice(20, 10000, 'dice20.png')

When we repeatedly sum a sufficiently large number of random values, the resulting distribution of the sums is approximately normal.

Note that the distribution in the above figure is still not actually a normal distribution, for several reasons, most importantly

What happens to the sums if the original distribution is not uniform?

Consider an unfair dice, such as this one, where the probability of 6 is increased:

    sample(c(1,2,3,4,5,6,6,6), 1000, replace=TRUE)
   [1] 1 4 6 2 4 1 3 1 1 1 6 5 6 6 5 1 2 6 4 6 5 1 6 6 3 5 1 3 1 4 2 1 3 2 3 1 6
  [38] 3 5 3 6 2 5 2 4 5 5 2 6 2 6 3 6 6 6 1 5 5 5 2 2 2 6 3 6 1 3 6 6 6 4 6 1 3
  [75] 4 4 2 6 2 3 2 6 1 6 1 5 5 6 6 3 2 5 5 6 6 6 5 6 1 4 5 3 6 6 6 6 5 6 2 2 2
 [112] 5 6 6 3 6 2 6 6 2 2 3 5 4 2 6 1 1 5 5 2 6 1 6 4 6 5 3 6 6 4 1 6 3 1 6 2 1
 [149] 6 1 6 1 3 6 5 6 1 1 6 6 4 3 4 6 6 3 3 5 6 3 6 6 4 5 5 4 4 3 2 4 4 3 6 4 6
 [186] 6 6 3 6 2 6 2 6 6 2 1 1 2 6 6 6 1 6 1 2 6 6 6 6 6 6 2 6 5 6 2 6 6 3 1 2 1
 ...

This dice will produce the result 6 much more frequently than a fair dice, so the distribution of the individual dice values is far from uniform:

  png('dicemod.png')
  x <- sample(c(1,2,3,4,5,6,6,6), 1000, replace=TRUE)
  plot(table(x), type="h", lwd=10, ylab="Freq")
  dev.off()

We modify our function dice() a bit to achieve this effect:

dice <- function(lst, ndice, ncast, fn) {
  png(fn)
  x <- sample(lst, ncast * ndice, replace = TRUE)
  m <- matrix(x, ndice, ncast)
  s <- colSums(m)
  plot(table(s), type="h", lwd=1, ylab="Freq")
  dev.off()
  return(1)
}

Now we can pass the list of values as a parameter:

dice(c(1,2,3,4,5,6,6,6), 20, 10000, 'unfair.png')

When we plot the sums of 20 unfair dice over 10000 tosses the distribution has shifted to the right, but other than that the picture is still very similar to the normal distribution.

The observations in this section are a consequence of the central limit theorem. There are several aspects to this theorem; a common formulation is that a sum of many independent and identically distributed random variables with finite variance will tend to be distributed according to the normal distribution. This is happening here, and in many practical applications, although whether the approximation to the normal distribution is close enough for practical purposes depends on the application.