%reload_ext rpy2.ipython
Let's assume that the highest-paid two people in our sales team have earned their position by being top performers for several years, while two people who are new to the team get lowest pay since they lack previous experience, and probably sell less than the seniors. Or do they?
%%R
sales <- c(20, 17, 24, 19, 24, 24, 21, 29, 13, 9)
pay <- c(2050, 2210, 1850, 2330, 1970, 2240, 2360, 2190, 1500, 1500)
sort(pay)
The sort() function returns the sorted values and does not help here, but the order() function returns the indices of the sorted values, and we can use it to sort the sales by pay.
%%R
order(pay)
%%R
z <- sales[order(pay)]
z
We form two groups by taking the people with the lowest and highest pay:
%%R
A <- z[1]+z[2]
B <- z[9]+z[10]
c(A,B)
When printed in this fashion, the data show that
Group A: the two lowest-paid people have the lowest sales
Group B: the two highest-paid people sell much more than group A
One could argue that there is clearly an element of random chance in sales, and this particular result is just coincidence - it does not signify a greater sales talent on the part of group B.
One way to tackle this question is using the chisq.test() with one-dimensional count data, in this case a goodness-of-fit test.
The Chi-square test is used here to test if a sample of data came from a population with a specific distribution.
In other words, the chi-square test is testing the null hypothesis which states that there is no significant difference between the expected and observed result.
We are using the test in its simplest form here, testing for equal probabilites in two classes.
We are testing $H_0$: the underlying random variable takes on values corresponding to the two classes with equal probabilities $p_1 = p_2 = 0.5$.
The underlying random variable in this case is defined as whether a sale is made by group A or B. It is equivalent to flipping a fair coin.
Since there is a total number of 62 sales, we would expect values of 31 for each group. The actual values are different, and we want to compute how likely we are to see this difference (or higher) from the expected values when the probabilities are in fact equal.
%%R
c(A, B)
%%R
chisq.test(c(a,b))
The p-value is the probability of obtaining a test statistic 'at least as extreme' as the one that was actually observed, assuming that the null hypothesis is true (in this case, that the population probabilities are equal).
The significance level is used to arrive at a decision: if the p-value is less than or equal to an (arbitrary!) significance level $\alpha$ then the null hypothesis is rejected, the outcome is said to be statistically significant at a given level $\alpha$, and the p-value is the probability of making a type I error i.e. rejecting the null hypothesis when it is in fact true.
Various values for $\alpha$ are commonly used, often $\alpha = 0.05$ which does not put a very rigorous limit on the type I error. Obviously, a lower level like $\alpha = 0.01$ is much more conservative.
The choice of $\alpha$ is crucial in the above example. The p-value = 0.02225 means that
with $\alpha = 0.05$ we would reject the null hypothesis that the probabilites are equal, in other words, we would state that based on our test the difference in sales is not just coincidence.
with $\alpha = 0.01$ we would not reject the null hypothesis, meaning that the difference in sales is just coincidence.
The Chi-square test uses the following assumptions and definitions:
The test works with data derived from counting independent occurrences in classes.
It tests $H_0$: the underlying random variable takes values corresponding to k classes with probabilites $p_1...p_k$
The observed counts $O_i$ for each class and the expected counts $E_i = N p_i$ are used in the test statistic:
$$T = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}$$
The distribution $\chi^2(k)$ is the distribution of a sum of the squares of k independent standard normal random variables.
The test statistic T follows $\chi^2(k-1)$ asymptotically.
Some interesting properties:
Since Chi-square testing uses an approximation it should not be done if the expected value $E_i$ in any category is less than 5.
There is no such limitation on the observed values $O_i$.
In many applications the $p_i$ are all identical, but they do not have to be.
The procedure of arriving at a decision is simple enough to be done 'by hand' with the help of a Chi-square table.
In order to understand the computation done by the R package and the reasoning behind the procedure we will now do the test 'by hand', as it had been done before the general availability of high computing power, when it was infeasible to compute the p-value.
We are testing $H_0$: the underlying random variable takes on values corresponding to the two classes with equal probabilities i.e. $p_1 = p_2 = 0.5$
$H_0$ can also be stated as: there is no significant difference between expected and observed results.
Our observations are the number of sales contracts in two classes: [22,40]
The total count is $N = 22 + 40 = 62$. With equal propabilities i.e. equal sales skill for both classes we would expect the value $N/2 = 31$ for each class.
The value of the test statistic is often called Chi-square value, which is confusing, as the test statistic approximates the $\chi^2$ distribution.
%%R
(22-31)^2/31 + (40-31)^2/31
To determine the degrees of freedom of the $\chi^2$ distribution we take the number of classes minus the reductions in the degrees of freedom. In this case there is one reduction since the numbers must always add up to N. In other words, if we change the first value, then the second value follows, in order to still sum to a given N.
In the table we look up the critical value for a given significance level and degress of freedom.
If the value of the test statistic is greater or equal to the critical value we reject $H_0$, otherwise we accept $H_0$.
Significance level 0.5 0.1 0.05 0.01 0.005 df 1 0.455 2.706 3.841 6.635 7.879 2 1.386 4.605 5.991 9.210 10.597 3 2.366 6.251 7.815 11.345 12.838 4 3.357 7.779 9.488 13.277 14.860 5 4.351 9.236 11.070 15.086 16.750 6 5.348 10.645 12.592 16.812 18.548 7 6.346 12.017 14.067 18.475 20.278 8 7.344 13.362 15.507 20.090 21.955 9 8.343 14.684 16.919 21.666 23.589 10 9.342 15.987 18.307 23.209 25.188 11 10.341 17.275 19.675 24.725 26.757 12 11.340 18.549 21.026 26.217 28.300 13 12.340 19.812 22.362 27.688 29.819 14 13.339 21.064 23.685 29.141 31.319 15 14.339 22.307 24.996 30.578 32.801 20 19.337 28.412 31.410 37.566 39.997 50 49.335 63.167 67.505 76.154 79.490
The value of the test statistic is 5.225806 and there is one degree of freedom.
For $\alpha = 0.05$:
The table gives the critical value of 3.841
The value of the test statistic is greater than the critical value.
We reject the null hypothesis.
The probability of making a type I error i.e. rejecting the null hypothesis when it is in fact true is less than $\alpha$
If we reject $H_0$ then the actual p-value remains unknown. This value can of course be calculated by hand, if only approximately, but it is very time-consuming to do so. However, we already know that it is not greater than $\alpha$, and with the help of the table we can find further limits: by looking up the next critical value for the given degrees of freedom in the example above we can determine that the p-value is smaller than 0.05 but greater than 0.01.
For $\alpha = 0.01$:
The critical value is 6.635
The value of the test statistic is smaller than the critical value.
We accept the null hypothesis.
Another question that may arise in the context of the sales analysis is: How close was the result?
With just 3 more sales for group A the result would have been:
%%R
chisq.test(c(A+3,B))
This p-value is higher than 0.05, i.e. in this case we can not reject the hypotheses that the population probabilites are equal, unless we assume the rather questionable significance level of $\alpha = 0.1$.
Note that 0.05 is not a very strict significance level - it means a chance of 1 in 20.
In the example above, with 3 more sales for group B the result would be:
%%R
chisq.test(c(A,B+3))
Here, the p-value is below $\alpha = 0.01$, and therefore our statement that the population probabilities are not equal would be much safer.