%reload_ext rpy2.ipython
The Chi-square test is often used to analyse tabular data expressed in a contingency table. The question here is whether paired observations are independent of each other.
Suppose that a database query returns the following values for the number of contracts in selected areas and industries:
Area A | Area B | |
---|---|---|
Mining | 16 | 6 |
Chemical | 9 | 15 |
It would seem that area A is a mining area, while area B is dominated by the chemical industry, in other words the number of contracts depends on the area. Can we find a statistical method to validate that statement?
Let's define the cells in the first row as (a, b) and the second row as (c, d).
%%R
a=16; b=6; c=9; d=15
matrix(c(a,b,c,d),nrow=2, byrow=T)
The chisq.test() is used here without Yate's correction to simplify matters:
%%R
chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=F)
The basic definition for the test statistic is still
$$T = \sum \frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}$$However, the calculation of the expected values is now a little more elaborate:
$$E_{i,j} = \frac{(\sum_c O_{i,c})(\sum_r O_{r,j})}{N}$$with $N = \sum_i \sum_j O_{i,j}$ i.e. the total number of observations.
To see why, add the row and column totals:
16 | 6 | R1 = 22 |
9 | 15 | R2 = 24 |
C1 = 25 | C2 = 21 | N = 46 |
For independence i.e. no influence of area on the number of contracts we expect the same ratio of mining contracts in both columns:
$$\frac{E_{1,1}}{C_1} = \frac{E_{1,2}}{C_2} = \frac{R_1}{N}$$which of cource means that
$$E_{1,1} = \frac{R_1 C_1}{N}$$Now we can calculate the expected values:
%%R
N=a+b+c+d
e11=(a+b)*(a+c)/N
e12=(b+a)*(b+d)/N
e21=(c+d)*(c+a)/N
e22=(d+c)*(d+b)/N
c(e11,e12,e21,e22)
[1] 11.95652 e12 [1] 10.04348 e21 [1] 13.04348 e22 [1] 10.95652
With the observed and expected values we can compute the test statistic:
%%R
(a-e11)^2/e11 + (b-e12)^2/e12 + (c-e21)^2/e21 + (d-e22)^2/e22
To find the critical values for a given significance level we need the degrees of freedom. Since we have 2 rows and 2 columns there are (r-1)(c-1) = 1 degrees of freedom.
Think of degrees of freedom as pieces of independent information. The row and column sums in the distribution are given; once $O_{1,1}$ is set everything else falls into place; therefore, there is only one value that is 'free to change'.
Without the help of R the calculation of the p-value for given Chi-square and df is somewhat tedious; on the other hand, for testing purposes we only need critical values which can be looked up in the table in the previous section.
E.g. for a significance level of $\alpha = 0.05$ and df = 1
we learn the critical value of 3.841 from the table
we see that $5.741039 > 3.841$
therefore we reject the null hypothesis at $\alpha = 0.05$
However, for $\alpha = 0.01$ and df = 1
the critical value from the table is 6.635
and since $5.741039 < 6.635$
we do not reject the null hypothesis at $\alpha = 0.01$
The Yate's correction applies to 2 x 2 tables and simply substracts 0.5 from each difference, and the test statistic becomes
$$\sum \frac{(|O_i - E_i| - 0.5)^2}{E_i}$$Including the Yate's correction changes the value of the test statistic but not the conclusions in this case:
%%R
chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=T)