Chi-Square for Test of Independence

The Chi-square test is often used to analyse tabular data expressed in a contingency table. The question here is whether paired observations are independent of each other.

Suppose that a database query returns the following values for the number of contracts in selected areas and industries:

            Area A   Area B
 Mining         16        6
 Chemical        9       15

It would seem that area A is a mining region, while area B is dominated by the chemical industry, in other words the number of contracts depends on the region. Is that statement valid?

Let's define the cells in the first row as (a, b) and the second row as (c, d).

    a=16; b=6; c=9; d=15

    matrix(c(a,b,c,d),nrow=2, byrow=T)
     [,1] [,2]
[1,]   16    6
[2,]    9   15

The chisq.test() is used without Yate's correction to simplify matters:

  chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=F)

	Pearson's Chi-squared test

data:  matrix(c(a, b, c, d), nrow = 2, byrow = T) 
X-squared = 5.741, df = 1, p-value = 0.01657

The basic definition for the test statistic is still in place, but the calculation of the expected values is now:

Ei,j = (Σr Or,j)(Σc Oi,c) / N where N = ΣiΣjOi,j i.e. the sum of all values.

To see why, add the row and column totals:

     16         6    R1 = 22
      9        15    R2 = 24
C1 = 25   C2 = 21    N  = 46

Now for independence i.e. no influence of region on number of contracts we expect e.g.
E1,1 / C1 = E1,2 / C2 = R1 / N
and therefore E1,1 = R1 C1 / N

We can calculate the expected values:

  N=a+b+c+d
  e11=(a+b)*(a+c)/N
  e12=(b+a)*(b+d)/N
  e21=(c+d)*(c+a)/N
  e22=(d+c)*(d+b)/N
  e11
[1] 11.95652
  e12
[1] 10.04348
  e21
[1] 13.04348
  e22
[1] 10.95652

The test statistic is:

  (a-e11)^2/e11 + (b-e12)^2/e12 + (c-e21)^2/e21 + (d-e22)^2/e22
[1] 5.741039

There are (r-1)(c-1) = 1 degrees of freedom.

Think of degrees of freedom as pieces of independent information. The row and column sums in the distribution are given; once O1,1 is set everything else falls into place; therefore, there is only one value that is 'free to change'.

Without the help of R the calculation of the p-value for given Chi-square and df is somewhat complex; on the other hand, for testing purposes we only need critical values which can be looked up in a table, such as the one already included above.

E.g. for a significance level of α = 0.05 and df = 1

However, for α = 0.01 and df = 1

The Yate's correction applies to 2 x 2 tables and simply substracts 0.5 from each difference, and the test statistic becomes

Σ (|Oi - Ei| - 0.5)2 / Ei

Including the Yate's correction changes the value of the test statistic but not the conclusions in this case:

  chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=T)

	Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(a, b, c, d), nrow = 2, byrow = T) 
X-squared = 4.409, df = 1, p-value = 0.03575