%reload_ext rpy2.ipython

Chi-Square Test for Independence¶

The Chi-square test is often used to analyse tabular data expressed in a contingency table. The question here is whether paired observations are independent of each other.

Suppose that a database query returns the following values for the number of contracts in selected areas and industries:

	Area A	Area B
Mining	16	6
Chemical	9	15

It would seem that area A is a mining area, while area B is dominated by the chemical industry, in other words the number of contracts depends on the area. Can we find a statistical method to validate that statement?

Let's define the cells in the first row as (a, b) and the second row as (c, d).

%%R
a=16; b=6; c=9; d=15
matrix(c(a,b,c,d),nrow=2, byrow=T)

     [,1] [,2]
[1,]   16    6
[2,]    9   15

The chisq.test() is used here without Yate's correction to simplify matters:

%%R
chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=F)

	Pearson's Chi-squared test

data:  matrix(c(a, b, c, d), nrow = 2, byrow = T)
X-squared = 5.741, df = 1, p-value = 0.01657

The basic definition for the test statistic is still

$$T = \sum \frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}$$

However, the calculation of the expected values is now a little more elaborate:

$$E_{i,j} = \frac{(\sum_c O_{i,c})(\sum_r O_{r,j})}{N}$$

with $N = \sum_i \sum_j O_{i,j}$ i.e. the total number of observations.

To see why, add the row and column totals:

16	6	R1 = 22
9	15	R2 = 24
C1 = 25	C2 = 21	N = 46

For independence i.e. no influence of area on the number of contracts we expect the same ratio of mining contracts in both columns:

$$\frac{E_{1,1}}{C_1} = \frac{E_{1,2}}{C_2} = \frac{R_1}{N}$$

which of cource means that

$$E_{1,1} = \frac{R_1 C_1}{N}$$

Now we can calculate the expected values:

%%R
N=a+b+c+d
e11=(a+b)*(a+c)/N
e12=(b+a)*(b+d)/N
e21=(c+d)*(c+a)/N
e22=(d+c)*(d+b)/N
c(e11,e12,e21,e22)

[1] 11.95652 10.04348 13.04348 10.95652

[1] 11.95652 e12 [1] 10.04348 e21 [1] 13.04348 e22 [1] 10.95652

With the observed and expected values we can compute the test statistic:

%%R
(a-e11)^2/e11 + (b-e12)^2/e12 + (c-e21)^2/e21 + (d-e22)^2/e22

[1] 5.741039

To find the critical values for a given significance level we need the degrees of freedom. Since we have 2 rows and 2 columns there are (r-1)(c-1) = 1 degrees of freedom.

Think of degrees of freedom as pieces of independent information. The row and column sums in the distribution are given; once $O_{1,1}$ is set everything else falls into place; therefore, there is only one value that is 'free to change'.

Without the help of R the calculation of the p-value for given Chi-square and df is somewhat tedious; on the other hand, for testing purposes we only need critical values which can be looked up in the table in the previous section.

E.g. for a significance level of $\alpha = 0.05$ and df = 1

we learn the critical value of 3.841 from the table
we see that $5.741039 > 3.841$
therefore we reject the null hypothesis at $\alpha = 0.05$

However, for $\alpha = 0.01$ and df = 1

the critical value from the table is 6.635
and since $5.741039 < 6.635$
we do not reject the null hypothesis at $\alpha = 0.01$

The Yate's correction applies to 2 x 2 tables and simply substracts 0.5 from each difference, and the test statistic becomes

$$\sum \frac{(|O_i - E_i| - 0.5)^2}{E_i}$$

Including the Yate's correction changes the value of the test statistic but not the conclusions in this case:

%%R
chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=T)

	Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(a, b, c, d), nrow = 2, byrow = T)
X-squared = 4.409, df = 1, p-value = 0.03575