Let's assume that the highest-paid two people in our sales team have earned their position by being top performers for several years, while two people who are new to the team get lowest pay since they lack previous experience, and probably sell less than the seniors. Or do they?
sort(salary) [1] 15000 15000 18464 19658 20495 21914 22061 22423 23335 23552
The sort() function returns the sorted values and does not help here, but the order() function returns the indices of the sorted values, and we can use it to sort the sales by salary.
salary [1] 20495 22061 18464 23335 19658 22423 23552 21914 15000 15000 sales [1] 20 17 24 19 24 24 21 29 13 9 order(salary) [1] 9 10 3 5 1 8 2 6 4 7 sales[order(salary)] [1] 13 9 24 24 20 29 17 24 19 21
When sorted in this fashion, the data show that
z <- sales[order(salary)] jun <- z[1]+z[2] sen <- z[9]+z[10] jun [1] 22 sen [1] 40
On the other hand, the difference is not really huge. One could argue that there is clearly an element of random chance in sales, and this particular result is just coincidence - it does not signify a greater sales talent on the part of the seniors.
One way to tackle this question is using the chisq.test() with one-dimensional count data, in this case a goodness-of-fit test.
The Chi-square test is used here to test if a sample of data came from a population with a specific distribution.
c(jun, sen) [1] 22 40 chisq.test(c(jun,sen)) Chi-squared test for given probabilities data: c(jun, sen) X-squared = 5.2258, df = 1, p-value = 0.02225
The p-value is the probability of obtaining a test statistic 'at least as extreme' as the one that was actually observed, assuming that the null hypothesis is true (in this case, that the population probabilities are equal).
The significance level is used to arrive at a decision: if the p-value is less than or equal to an (arbitrary!) significance level α, then the null hypothesis is rejected, the outcome is said to be statistically significant at a given level α, and the p-value is the probability of making a type I error i.e. rejecting the null hypothesis when it is in fact true.
Traditionally, either the α = 0.05 level (5% level) or the α = 0.01 level (1% level) have been used. Obviously, α = 0.01 is much more conservative than α = 0.05.
The choice of α is crucial in the above example. The p-value = 0.02225 means that
The Chi-square test uses the following assumptions and definitions:
Some interesting properties:
In order to understand the computation done by the R package and the reasoning behind the procedure we will now do the test 'by hand', as it had been done before the general availability of high computing power, when it was infeasible to compute the p-value.
(22-31)^2/31 + (40-31)^2/31 [1] 5.225806
Significance level 0.5 0.1 0.05 0.01 0.005 df 1 0.455 2.706 3.841 6.635 7.879 2 1.386 4.605 5.991 9.210 10.597 3 2.366 6.251 7.815 11.345 12.838 4 3.357 7.779 9.488 13.277 14.860 5 4.351 9.236 11.070 15.086 16.750 6 5.348 10.645 12.592 16.812 18.548 7 6.346 12.017 14.067 18.475 20.278 8 7.344 13.362 15.507 20.090 21.955 9 8.343 14.684 16.919 21.666 23.589 10 9.342 15.987 18.307 23.209 25.188 11 10.341 17.275 19.675 24.725 26.757 12 11.340 18.549 21.026 26.217 28.300 13 12.340 19.812 22.362 27.688 29.819 14 13.339 21.064 23.685 29.141 31.319 15 14.339 22.307 24.996 30.578 32.801 20 19.337 28.412 31.410 37.566 39.997 50 49.335 63.167 67.505 76.154 79.490 100 99.334 118.498 124.342 135.807 140.170
If we reject H0 in the above procedure the actual p-value remains unknown. This value can of course be calculated by hand, if only approximately, but it is very time-consuming to do so. However, we already know that it is not greater than α, and with the help of the table we can find further limits: by looking up the next critical value for the given degrees of freedom in the example above we can determine that the p-value is smaller than 0.05 but greater than 0.01.
To further motivate the discussion, here is a Python simulation code for tossing a fair coin and counting the number of cases where we arrive at a test statistic value at least as large as the one given on the command line; and here is some output from this program:
xmdimrill:% chisim.py 2 62 5.2258 10000 check: [22, 40] 5.22580645161 [37, 25] 2.3226 [33, 29] 0.2581 [24, 38] 3.1613 [40, 22] 5.2258 [28, 34] 0.5806 [29, 33] 0.2581 [34, 28] 0.5806 [30, 32] 0.0645 [31, 31] 0.0000 [31, 31] 0.0000 [33, 29] 0.2581 [32, 30] 0.0645 [33, 29] 0.2581 [27, 35] 1.0323 [32, 30] 0.0645 [29, 33] 0.2581 [27, 35] 1.0323 [27, 35] 1.0323 [27, 35] 1.0323 observed avg: [31.045000000000002, 30.954999999999998] chisq test stat >= 5.2258 : 0.0289
The simulation works with many-sided fair coins, in this case the standard two-sided coin.
Another question that may arise in the context of the sales analysis is: How close was the result? Was it maybe a close shave for the seniors?
With just 3 more sales for the juniors the result would have been:
c(jun+3,sen) [1] 25 40 chisq.test(c(jun+3,sen)) Chi-squared test for given probabilities data: c(jun + 3, sen) X-squared = 3.4615, df = 1, p-value = 0.06281
This p-value is higher than 0.05, i.e. in this case we can not reject the hypotheses that the population probabilites are equal, unless we assume the rather questionable significance level of α = 0.1.
Note that even 0.05 is not a very strict significance level - it means a chance of 1 in 20.
In the example above, with 3 more sales for the seniors the result would be:
c(jun,sen+3) [1] 22 43 chisq.test(c(jun,sen+3)) Chi-squared test for given probabilities data: c(jun, sen + 3) X-squared = 6.7846, df = 1, p-value = 0.009195
Here, the p-value is below α = 0.01, and therefore our statement that the population probabilities are not equal would be much safer.