%reload_ext rpy2.ipython

Chi Square Test for Fit¶

Let's assume that the highest-paid two people in our sales team have earned their position by being top performers for several years, while two people who are new to the team get lowest pay since they lack previous experience, and probably sell less than the seniors. Or do they?

%%R
sales <- c(20, 17, 24, 19, 24, 24, 21, 29, 13,  9)
pay <- c(2050, 2210, 1850, 2330, 1970, 2240, 2360, 2190, 1500, 1500)
sort(pay)

 [1] 1500 1500 1850 1970 2050 2190 2210 2240 2330 2360

The sort() function returns the sorted values and does not help here, but the order() function returns the indices of the sorted values, and we can use it to sort the sales by pay.

%%R
order(pay)

 [1]  9 10  3  5  1  8  2  6  4  7

%%R
z <- sales[order(pay)]
z

 [1] 13  9 24 24 20 29 17 24 19 21

We form two groups by taking the people with the lowest and highest pay:

%%R
A <- z[1]+z[2]
B <- z[9]+z[10]
c(A,B)

[1] 22 40

When printed in this fashion, the data show that

Group A: the two lowest-paid people have the lowest sales

Group B: the two highest-paid people sell much more than group A

One could argue that there is clearly an element of random chance in sales, and this particular result is just coincidence - it does not signify a greater sales talent on the part of group B.

One way to tackle this question is using the chisq.test() with one-dimensional count data, in this case a goodness-of-fit test.

The Chi-square test is used here to test if a sample of data came from a population with a specific distribution.

In other words, the chi-square test is testing the null hypothesis which states that there is no significant difference between the expected and observed result.

We are using the test in its simplest form here, testing for equal probabilites in two classes.

We are testing $H_0$: the underlying random variable takes on values corresponding to the two classes with equal probabilities $p_1 = p_2 = 0.5$.

The underlying random variable in this case is defined as whether a sale is made by group A or B. It is equivalent to flipping a fair coin.

Since there is a total number of 62 sales, we would expect values of 31 for each group. The actual values are different, and we want to compute how likely we are to see this difference (or higher) from the expected values when the probabilities are in fact equal.

%%R
c(A, B)

[1] 22 40

%%R
chisq.test(c(a,b))

	Chi-squared test for given probabilities

data:  c(a, b)
X-squared = 5.2258, df = 1, p-value = 0.02225

The p-value is the probability of obtaining a test statistic 'at least as extreme' as the one that was actually observed, assuming that the null hypothesis is true (in this case, that the population probabilities are equal).

The significance level is used to arrive at a decision: if the p-value is less than or equal to an (arbitrary!) significance level $\alpha$ then the null hypothesis is rejected, the outcome is said to be statistically significant at a given level $\alpha$, and the p-value is the probability of making a type I error i.e. rejecting the null hypothesis when it is in fact true.

Various values for $\alpha$ are commonly used, often $\alpha = 0.05$ which does not put a very rigorous limit on the type I error. Obviously, a lower level like $\alpha = 0.01$ is much more conservative.

The choice of $\alpha$ is crucial in the above example. The p-value = 0.02225 means that

with $\alpha = 0.05$ we would reject the null hypothesis that the probabilites are equal, in other words, we would state that based on our test the difference in sales is not just coincidence.
with $\alpha = 0.01$ we would not reject the null hypothesis, meaning that the difference in sales is just coincidence.

The Chi-square test uses the following assumptions and definitions:

The test works with data derived from counting independent occurrences in classes.
It tests $H_0$: the underlying random variable takes values corresponding to k classes with probabilites $p_1...p_k$
The observed counts $O_i$ for each class and the expected counts $E_i = N p_i$ are used in the test statistic:

$$T = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}$$
The distribution $\chi^2(k)$ is the distribution of a sum of the squares of k independent standard normal random variables.
The test statistic T follows $\chi^2(k-1)$ asymptotically.

Some interesting properties:

Since Chi-square testing uses an approximation it should not be done if the expected value $E_i$ in any category is less than 5.
There is no such limitation on the observed values $O_i$.
In many applications the $p_i$ are all identical, but they do not have to be.
The procedure of arriving at a decision is simple enough to be done 'by hand' with the help of a Chi-square table.

In order to understand the computation done by the R package and the reasoning behind the procedure we will now do the test 'by hand', as it had been done before the general availability of high computing power, when it was infeasible to compute the p-value.

We are testing $H_0$: the underlying random variable takes on values corresponding to the two classes with equal probabilities i.e. $p_1 = p_2 = 0.5$

$H_0$ can also be stated as: there is no significant difference between expected and observed results.

Our observations are the number of sales contracts in two classes: [22,40]

The total count is $N = 22 + 40 = 62$. With equal propabilities i.e. equal sales skill for both classes we would expect the value $N/2 = 31$ for each class.

The value of the test statistic is often called Chi-square value, which is confusing, as the test statistic approximates the $\chi^2$ distribution.

%%R
(22-31)^2/31 + (40-31)^2/31

[1] 5.225806

To determine the degrees of freedom of the $\chi^2$ distribution we take the number of classes minus the reductions in the degrees of freedom. In this case there is one reduction since the numbers must always add up to N. In other words, if we change the first value, then the second value follows, in order to still sum to a given N.

In the table we look up the critical value for a given significance level and degress of freedom.

If the value of the test statistic is greater or equal to the critical value we reject $H_0$, otherwise we accept $H_0$.

            Significance level
    0.5     0.1     0.05    0.01    0.005
df
1   0.455   2.706   3.841   6.635   7.879 
2   1.386   4.605   5.991   9.210  10.597 
3   2.366   6.251   7.815  11.345  12.838 
4   3.357   7.779   9.488  13.277  14.860
5   4.351   9.236  11.070  15.086  16.750
6   5.348  10.645  12.592  16.812  18.548
7   6.346  12.017  14.067  18.475  20.278
8   7.344  13.362  15.507  20.090  21.955
9   8.343  14.684  16.919  21.666  23.589
10  9.342  15.987  18.307  23.209  25.188
11 10.341  17.275  19.675  24.725  26.757
12 11.340  18.549  21.026  26.217  28.300
13 12.340  19.812  22.362  27.688  29.819
14 13.339  21.064  23.685  29.141  31.319
15 14.339  22.307  24.996  30.578  32.801
20 19.337  28.412  31.410  37.566  39.997
50 49.335  63.167  67.505  76.154  79.490

The value of the test statistic is 5.225806 and there is one degree of freedom.

For $\alpha = 0.05$:

The table gives the critical value of 3.841
The value of the test statistic is greater than the critical value.
We reject the null hypothesis.
The probability of making a type I error i.e. rejecting the null hypothesis when it is in fact true is less than $\alpha$

If we reject $H_0$ then the actual p-value remains unknown. This value can of course be calculated by hand, if only approximately, but it is very time-consuming to do so. However, we already know that it is not greater than $\alpha$, and with the help of the table we can find further limits: by looking up the next critical value for the given degrees of freedom in the example above we can determine that the p-value is smaller than 0.05 but greater than 0.01.

For $\alpha = 0.01$:

The critical value is 6.635
The value of the test statistic is smaller than the critical value.
We accept the null hypothesis.

Another question that may arise in the context of the sales analysis is: How close was the result?

With just 3 more sales for group A the result would have been:

%%R
chisq.test(c(A+3,B))

	Chi-squared test for given probabilities

data:  c(A + 3, B)
X-squared = 3.4615, df = 1, p-value = 0.06281

This p-value is higher than 0.05, i.e. in this case we can not reject the hypotheses that the population probabilites are equal, unless we assume the rather questionable significance level of $\alpha = 0.1$.

Note that 0.05 is not a very strict significance level - it means a chance of 1 in 20.

In the example above, with 3 more sales for group B the result would be:

%%R
chisq.test(c(A,B+3))

	Chi-squared test for given probabilities

data:  c(A, B + 3)
X-squared = 6.7846, df = 1, p-value = 0.009195

Here, the p-value is below $\alpha = 0.01$, and therefore our statement that the population probabilities are not equal would be much safer.