The Chi-square test is often used to analyse tabular data expressed in a contingency table. The question here is whether paired observations are independent of each other.
Suppose that a database query returns the following values for the number of contracts in selected areas and industries:
Area A Area B Mining 16 6 Chemical 9 15
It would seem that area A is a mining region, while area B is dominated by the chemical industry, in other words the number of contracts depends on the region. Is that statement valid?
Let's define the cells in the first row as (a, b) and the second row as (c, d).
a=16; b=6; c=9; d=15 matrix(c(a,b,c,d),nrow=2, byrow=T) [,1] [,2] [1,] 16 6 [2,] 9 15
The chisq.test() is used without Yate's correction to simplify matters:
chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=F) Pearson's Chi-squared test data: matrix(c(a, b, c, d), nrow = 2, byrow = T) X-squared = 5.741, df = 1, p-value = 0.01657
The basic definition for the test statistic is still in place, but the calculation of the expected values is now:
Ei,j = (Σr Or,j)(Σc Oi,c) / N where N = ΣiΣjOi,j i.e. the sum of all values.
To see why, add the row and column totals:
16 6 R1 = 22 9 15 R2 = 24 C1 = 25 C2 = 21 N = 46
Now for independence i.e. no influence of region on number of contracts we
expect e.g.
E1,1 / C1 = E1,2 / C2
= R1 / N
and therefore E1,1 = R1 C1
/ N
We can calculate the expected values:
N=a+b+c+d e11=(a+b)*(a+c)/N e12=(b+a)*(b+d)/N e21=(c+d)*(c+a)/N e22=(d+c)*(d+b)/N e11 [1] 11.95652 e12 [1] 10.04348 e21 [1] 13.04348 e22 [1] 10.95652
The test statistic is:
(a-e11)^2/e11 + (b-e12)^2/e12 + (c-e21)^2/e21 + (d-e22)^2/e22 [1] 5.741039
There are (r-1)(c-1) = 1 degrees of freedom.
Think of degrees of freedom as pieces of independent information. The row and column sums in the distribution are given; once O1,1 is set everything else falls into place; therefore, there is only one value that is 'free to change'.
Without the help of R the calculation of the p-value for given Chi-square and df is somewhat complex; on the other hand, for testing purposes we only need critical values which can be looked up in a table, such as the one already included above.
E.g. for a significance level of α = 0.05 and df = 1
However, for α = 0.01 and df = 1
The Yate's correction applies to 2 x 2 tables and simply substracts 0.5 from each difference, and the test statistic becomes
Σ (|Oi - Ei| - 0.5)2 / Ei
Including the Yate's correction changes the value of the test statistic but not the conclusions in this case:
chisq.test(matrix(c(a,b,c,d),nrow=2, byrow=T), correct=T) Pearson's Chi-squared test with Yates' continuity correction data: matrix(c(a, b, c, d), nrow = 2, byrow = T) X-squared = 4.409, df = 1, p-value = 0.03575