Chi Square for Goodness of Fit

Let's assume that the highest-paid two people in our sales team have earned their position by being top performers for several years, while two people who are new to the team get lowest pay since they lack previous experience, and probably sell less than the seniors. Or do they?

  sort(salary)
 [1] 15000 15000 18464 19658 20495 21914 22061 22423 23335 23552

The sort() function returns the sorted values and does not help here, but the order() function returns the indices of the sorted values, and we can use it to sort the sales by salary.

  salary
 [1] 20495 22061 18464 23335 19658 22423 23552 21914 15000 15000
  sales
 [1] 20 17 24 19 24 24 21 29 13  9
  order(salary)
 [1]  9 10  3  5  1  8  2  6  4  7
  sales[order(salary)]
 [1] 13  9 24 24 20 29 17 24 19 21

When sorted in this fashion, the data show that

the two lowest-paid people (the juniors in rows 9 and 10 of the file) actually have the lowest sales
the two highest-paid people (4 and 7) do not have the highest sales
however, the sum of their sales is higher than the junior's:

  z <- sales[order(salary)]
  jun <- z[1]+z[2]
  sen <- z[9]+z[10]
  jun
[1] 22
  sen
[1] 40

On the other hand, the difference is not really huge. One could argue that there is clearly an element of random chance in sales, and this particular result is just coincidence - it does not signify a greater sales talent on the part of the seniors.

One way to tackle this question is using the chisq.test() with one-dimensional count data, in this case a goodness-of-fit test.

The Chi-square test is used here to test if a sample of data came from a population with a specific distribution.

In other words, the chi-square test is testing the null hypothesis which states that there is no significant difference between the expected and observed result.
We are using the test in its simplest form here, testing for equal probabilites in two classes.
We are testing H₀: the underlying random variable takes on values corresponding to the two classes with equal probabilities p₁ = p₂ = 1/2.
The underlying random variable in this case is defined as whether a sale is made by juniors or seniors. It is equivalent to flipping a fair coin.
Since there is a total number of 62 sales, we would expect values of 31 for both juniors and seniors. The actual values are different, and we are computing how likely that result is when the probabilities are in fact equal.

    c(jun, sen)
[1] 22 40

  chisq.test(c(jun,sen))

	Chi-squared test for given probabilities

data:  c(jun, sen) 
X-squared = 5.2258, df = 1, p-value = 0.02225

The p-value is the probability of obtaining a test statistic 'at least as extreme' as the one that was actually observed, assuming that the null hypothesis is true (in this case, that the population probabilities are equal).

The significance level is used to arrive at a decision: if the p-value is less than or equal to an (arbitrary!) significance level α, then the null hypothesis is rejected, the outcome is said to be statistically significant at a given level α, and the p-value is the probability of making a type I error i.e. rejecting the null hypothesis when it is in fact true.

Traditionally, either the α = 0.05 level (5% level) or the α = 0.01 level (1% level) have been used. Obviously, α = 0.01 is much more conservative than α = 0.05.

The choice of α is crucial in the above example. The p-value = 0.02225 means that

with α = 0.05 we would reject the null hypothesis that the probabilites are equal, in other words, we would state that based on our test the difference in sales is not just coincidence.
with α = 0.01 we would not reject the null hypothesis, meaning that the difference in sales is just coincidence.

The Chi-square test uses the following assumptions and definitions:

The test works with data derived from counting independent occurrences in classes.
It tests H₀: the underlying random variable takes values corresponding to k classes with probabilites p₁..p_k
The observed counts O_i for each class and the expected counts E_i = N p_i are used in the test statistic: T = Σ_i=1^k (O_i - E_i)² / E_i
The distribution Χ²(k) is the distribution of a sum of the squares of k independent standard normal random variables.
The test statistic T follows Χ²(k-1) asymptotically.

Some interesting properties:

Since Chi-square testing uses an approximation it should not be done if the expected value E_i in any category is less than 5.
There is no such limitation on the observed values O_i.
In many applications the p_i are all identical, but they do not have to be.
The procedure of arriving at a decision is simple enough to be done 'by hand' with the help of a Chi-square table.

In order to understand the computation done by the R package and the reasoning behind the procedure we will now do the test 'by hand', as it had been done before the general availability of high computing power, when it was infeasible to compute the p-value.

We are testing H₀: the underlying random variable takes on values corresponding to the two classes with equal probabilities i.e. p₁ = p₂ = 1/2.
H₀ can also be stated as: there is no significant difference between expected and observed results.
Our observations are the number of sales contracts in two classes: [22, 40]
The total count is N = 22 + 40 = 62. With equal propabilities i.e. equal sales skill for both classes we would expect the value N/2 = 31 for each class.
The value of the test statistic T = Σ (O_i - E_i)² / E_i is often called Chi-square value, which is confusing, as the test statistic approximates the Χ² distribution.
```
    (22-31)^2/31 + (40-31)^2/31
[1] 5.225806
```
To determine the degrees of freedom of the Χ² distribution we take the number of classes minus the reductions in the degrees of freedom. In this case there is one reduction since the numbers must always add up to N. In other words, if we change the first value, then the second value follows, in order to still sum to a given N.
In the table we look up the critical value for a given significance level and degress of freedom.

If the value of the test statistic is greater or equal to the critical value we reject H₀; otherwise, we accept H₀.

             Significance level

       0.5     0.1    0.05    0.01   0.005
 df
  1  0.455   2.706   3.841   6.635   7.879 
  2  1.386   4.605   5.991   9.210  10.597 
  3  2.366   6.251   7.815  11.345  12.838 
  4  3.357   7.779   9.488  13.277  14.860
  5  4.351   9.236  11.070  15.086  16.750
  6  5.348  10.645  12.592  16.812  18.548
  7  6.346  12.017  14.067  18.475  20.278
  8  7.344  13.362  15.507  20.090  21.955
  9  8.343  14.684  16.919  21.666  23.589
 10  9.342  15.987  18.307  23.209  25.188
 11 10.341  17.275  19.675  24.725  26.757
 12 11.340  18.549  21.026  26.217  28.300
 13 12.340  19.812  22.362  27.688  29.819
 14 13.339  21.064  23.685  29.141  31.319
 15 14.339  22.307  24.996  30.578  32.801
 20 19.337  28.412  31.410  37.566  39.997
 50 49.335  63.167  67.505  76.154  79.490
100 99.334 118.498 124.342 135.807 140.170

The value of the test statistic is 5.225806 and there is one degree of freedom.
- For α = 0.05:
  - The table gives the critical value of 3.841
  - The value of the test statistic is greater than the critical value.
  - We reject the null hypothesis.
- For α = 0.01:
  - The critical value is 6.635
  - The value of the test statistic is smaller than the critical value.
  - We accept the null hypothesis.

If we reject H₀ in the above procedure the actual p-value remains unknown. This value can of course be calculated by hand, if only approximately, but it is very time-consuming to do so. However, we already know that it is not greater than α, and with the help of the table we can find further limits: by looking up the next critical value for the given degrees of freedom in the example above we can determine that the p-value is smaller than 0.05 but greater than 0.01.

To further motivate the discussion, here is a Python simulation code for tossing a fair coin and counting the number of cases where we arrive at a test statistic value at least as large as the one given on the command line; and here is some output from this program:

xmdimrill:% chisim.py 2 62 5.2258 10000
check: [22, 40] 5.22580645161
[37, 25] 2.3226
[33, 29] 0.2581
[24, 38] 3.1613
[40, 22] 5.2258
[28, 34] 0.5806
[29, 33] 0.2581
[34, 28] 0.5806
[30, 32] 0.0645
[31, 31] 0.0000
[31, 31] 0.0000
[33, 29] 0.2581
[32, 30] 0.0645
[33, 29] 0.2581
[27, 35] 1.0323
[32, 30] 0.0645
[29, 33] 0.2581
[27, 35] 1.0323
[27, 35] 1.0323
[27, 35] 1.0323
observed avg: [31.045000000000002, 30.954999999999998]
chisq test stat >= 5.2258 : 0.0289

The simulation works with many-sided fair coins, in this case the standard two-sided coin.

First we check the test statistic for the example.
Then the coin is tossed N = 62 times, the heads and tails are counted, and the corresponding test statistic is computed. This step is repeated 10000 times, but only the first 20 trials are output to provide some tracing.
The end result is the fraction of trials where the test statistic was at least 5.2258.
This fraction is close to the p-value of the example.
The observed averages are also output, and as expected they are close to N/2.

Another question that may arise in the context of the sales analysis is: How close was the result? Was it maybe a close shave for the seniors?

With just 3 more sales for the juniors the result would have been:

    c(jun+3,sen)
[1] 25 40

  chisq.test(c(jun+3,sen))

	Chi-squared test for given probabilities

data:  c(jun + 3, sen) 
X-squared = 3.4615, df = 1, p-value = 0.06281

This p-value is higher than 0.05, i.e. in this case we can not reject the hypotheses that the population probabilites are equal, unless we assume the rather questionable significance level of α = 0.1.

Note that even 0.05 is not a very strict significance level - it means a chance of 1 in 20.

In the example above, with 3 more sales for the seniors the result would be:

  c(jun,sen+3)
[1] 22 43

  chisq.test(c(jun,sen+3))

	Chi-squared test for given probabilities

data:  c(jun, sen + 3) 
X-squared = 6.7846, df = 1, p-value = 0.009195

Here, the p-value is below α = 0.01, and therefore our statement that the population probabilities are not equal would be much safer.