In [10]:
%reload_ext rpy2.ipython

Chi-Square Test for Independence

The Chi-square test is often used to analyse tabular data expressed in a contingency table. The question here is whether paired observations are independent of each other.

Suppose that a database query returns the following values for the number of contracts in selected areas and industries:

Area A Area B
Mining 16 6
Chemical 9 15

It would seem that area A is a mining area, while area B is dominated by the chemical industry, in other words the number of contracts depends on the area. Can we find a statistical method to validate that statement?

Let's define the cells in the first row as (a, b) and the second row as (c, d).

In [11]:
%%R
a=16; b=6; c=9; d=15
matrix(c(a,b,c,d),nrow=2, byrow=T)
     [,1] [,2]
[1,]   16    6
[2,]    9   15

The chisq.test() is used here without Yate's correction to simplify matters:

In [12]:
%%R
chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=F)
	Pearson's Chi-squared test

data:  matrix(c(a, b, c, d), nrow = 2, byrow = T)
X-squared = 5.741, df = 1, p-value = 0.01657

The basic definition for the test statistic is still

$$T = \sum \frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}$$

However, the calculation of the expected values is now a little more elaborate:

$$E_{i,j} = \frac{(\sum_c O_{i,c})(\sum_r O_{r,j})}{N}$$

with $N = \sum_i \sum_j O_{i,j}$ i.e. the total number of observations.

To see why, add the row and column totals:

16 6 R1 = 22
9 15 R2 = 24
C1 = 25 C2 = 21 N = 46

For independence i.e. no influence of area on the number of contracts we expect the same ratio of mining contracts in both columns:

$$\frac{E_{1,1}}{C_1} = \frac{E_{1,2}}{C_2} = \frac{R_1}{N}$$

which of cource means that

$$E_{1,1} = \frac{R_1 C_1}{N}$$

Now we can calculate the expected values:

In [13]:
%%R
N=a+b+c+d
e11=(a+b)*(a+c)/N
e12=(b+a)*(b+d)/N
e21=(c+d)*(c+a)/N
e22=(d+c)*(d+b)/N
c(e11,e12,e21,e22)
[1] 11.95652 10.04348 13.04348 10.95652

[1] 11.95652 e12 [1] 10.04348 e21 [1] 13.04348 e22 [1] 10.95652

With the observed and expected values we can compute the test statistic:

In [14]:
%%R
(a-e11)^2/e11 + (b-e12)^2/e12 + (c-e21)^2/e21 + (d-e22)^2/e22
[1] 5.741039

To find the critical values for a given significance level we need the degrees of freedom. Since we have 2 rows and 2 columns there are (r-1)(c-1) = 1 degrees of freedom.

Think of degrees of freedom as pieces of independent information. The row and column sums in the distribution are given; once $O_{1,1}$ is set everything else falls into place; therefore, there is only one value that is 'free to change'.

Without the help of R the calculation of the p-value for given Chi-square and df is somewhat tedious; on the other hand, for testing purposes we only need critical values which can be looked up in the table in the previous section.

E.g. for a significance level of $\alpha = 0.05$ and df = 1

  • we learn the critical value of 3.841 from the table

  • we see that $5.741039 > 3.841$

  • therefore we reject the null hypothesis at $\alpha = 0.05$

However, for $\alpha = 0.01$ and df = 1

  • the critical value from the table is 6.635

  • and since $5.741039 < 6.635$

  • we do not reject the null hypothesis at $\alpha = 0.01$

The Yate's correction applies to 2 x 2 tables and simply substracts 0.5 from each difference, and the test statistic becomes

$$\sum \frac{(|O_i - E_i| - 0.5)^2}{E_i}$$

Including the Yate's correction changes the value of the test statistic but not the conclusions in this case:

In [15]:
%%R
chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=T)
	Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(a, b, c, d), nrow = 2, byrow = T)
X-squared = 4.409, df = 1, p-value = 0.03575

In [ ]: