A question related to dependence vs independence is whether there is an association between two rationally scaled sample variables, such as salary and sales. We can ask: How strong is that association?
The Pearson correlation coefficient is rx,y = sx,y / (sx sy) with sample covariance sx,y = 1/(N-1) Σ (xi-mx) (yi - my)
The sample standard deviation is sx = sqrt(sx,x)
cor(salary, sales) [1] 0.6196906
The estimated correlation of 0.62 is not strong.
The cor.test() shows that at α = 0.05 this value is not significant.
cor.test(sales, salary) Pearson's product-moment correlation data: sales and salary t = 2.2332, df = 8, p-value = 0.05601 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.01629263 0.89867697 sample estimates: cor 0.6196906
The idea of the test is that we have drawn a small sample of N values from a large population. If the true correlation in the population is r=0, then the statistic t = r / √ ( (1-r2) / (N-2) ) is approximately t-distributed with df=N-2, for N≥6 and normally distributed underlying variables.
The corresponding p-value in the example is 0.05601; in other words, there is a rather high probability of 0.05601 that with our small sample we arrive at the estimate of r=0.62 when the actual correlation in the population is zero.
Another way of studying the association between variables is to fit a linear model.
For vectors x and y: Find β0, β1 in yi = β0 + β1 xi + εi such that ∑ εi2 is minimal
Recall the values for x and y:
salary [1] 20495 22061 18464 23335 19658 22423 23552 21914 15000 15000 sales [1] 20 17 24 19 24 24 21 29 13 9
Fitting the model in R uses the symbol "~" in the formula:
fit <- lm(sales ~ salary) fit Call: lm(formula = sales ~ salary) Coefficients: (Intercept) salary -3.236122 0.001151
Printing the fitted model only shows the estimates for intercept and gradient, in this case salary. Strangely enough, the summary() function provides more information:
summary(fit) Call: lm(formula = sales ~ salary) Residuals: Min 1Q Median 3Q Max -5.1530 -4.1817 -0.6888 3.8170 7.0161 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.2361224 10.5187490 -0.308 0.766 salary 0.0011509 0.0005153 2.233 0.056 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.886 on 8 degrees of freedom Multiple R-squared: 0.384, Adjusted R-squared: 0.307 F-statistic: 4.987 on 1 and 8 DF, p-value: 0.05601
The first data column indicates the estimates for β0 and β1.
The estimates divided by their standard errors result in t-values, e.g. for salary
0.0011509 / 0.0005153 [1] 2.233456
Pr(>|t|) is the probability of obtaining an estimate with at least that t-value if the actual coefficient is zero. Since all probabilities are higher than 0.05, none of the estimates is significant at α = 0.05.
The actual values yi are:
sales [1] 20 17 24 19 24 24 21 29 13 9
To gain an idea on how well the model performs for each data point we can look at the fitted values and the residuals:
fitted(fit) 1 2 3 4 5 6 7 8 20.35078 22.15303 18.01338 23.61923 19.38751 22.56964 23.86897 21.98385 9 10 14.02680 14.02680 resid(fit) 1 2 3 4 5 6 7 -0.3507826 -5.1530316 5.9866170 -4.6192290 4.6124885 1.4303566 -2.8689659 8 9 10 7.0161451 -1.0267990 -5.0267990
Residual standard error is defined as √(SSE/(n-2)) for
estimating two parameters β0 and β1, with
SSE = Σ (yi - yi^)2
sqrt(sum(resid(fit)^2)/(length(sales)-2)) [1] 4.885628
About 2/3 of the residuals are in the range ±4.9 (if the residuals are approximately normal).
Multiple R-squared is the squared correlation coefficient, indicating the ratio of explained variation and total variation (from the mean of the yi values); a ratio close to 1 indicates a good fit.
Note that for each data point there is a variation from the mean composed of yi^ - my and yi - yi^.
This leads to the definition of R2 as the fraction of explained and total variation:
SSR = Σ (yi^ - my)2
SSE = Σ (yi - yi^)2
SST = SSR + SSE
R2 = SSR / SST
Calculating these values confirms the model result:
ssr = sum((fitted(fit) - mean(sales))^2) sse = sum((sales - fitted(fit))^2) ssr [1] 119.0451 sse [1] 190.9549 ssr/(ssr+sse) [1] 0.3840165
Note that in this case, since this R-square is not actually multiple:
cor(sales, salary)^2 [1] 0.3840165
The Adjusted R-square takes the number of parameters into account.
The F-statistic aims at testing H0: all βi>0 = 0 against Ha: at least one of βi>0 is non-zero
in other words we compare the small model Yi = β0 + εi with the full model Yi = β0 + β1 Xi + εi
To interpret the F-statistic and the p-value, we take a look at the ANOVA table:
anova(m) Analysis of Variance Table Response: sales Df Sum Sq Mean Sq F value Pr(>F) salary 1 119.05 119.045 4.9874 0.05601 . Residuals 8 190.96 23.869 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 119.045 / 23.869 [1] 4.987431
The Analysis of Variance Table provides data on the variance explained by the model, here under Mean Sq for salary, and the residual or unexplained variance, Mean Sq for Residuals. The ratio of those two is the F-value. In this case the F-value is small, the associated probability for this F-value is higher than α = 0.05, and we cannot reject H0.
This model does not provide a good fit. None of the estimates are significant at α = 0.05. R-square and p-value of the model are not at all satisfactory.
To see why, we plot the data together with the fitted model:
png('sales-salary.png') plot(x=salary, y=sales) abline(fit) dev.off()
Looking at the plot, we observe that
We have not yet used the information on how long the employees have been in the company:
years [1] 5 5 5 6 7 8 7 8 0 0 salary [1] 20495 22061 18464 23335 19658 22423 23552 21914 15000 15000 sales [1] 20 17 24 19 24 24 21 29 13 9 cor(sales, years) [1] 0.8937891 cor(salary, years) [1] 0.8729632
Sales and years are highly correlated, therefore we try to fit a linear model:
summary(lm(sales ~ years)) Call: lm(formula = sales ~ years) Residuals: Min 1Q Median 3Q Max -2.8205 -2.2692 -0.5124 1.7617 4.1795 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.8479 1.8482 5.870 0.000374 *** years 1.7945 0.3184 5.637 0.000489 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.792 on 8 degrees of freedom Multiple R-squared: 0.7989, Adjusted R-squared: 0.7737 F-statistic: 31.77 on 1 and 8 DF, p-value: 0.0004889
This model is much better in terms of R-square and p-values than (sales ~ salary). With very high significance for both estimates it is completely acceptable. The number of years clearly explains the sales values.
To see why, we sort sales by order of years, and make another plot:
fit <- lm(sales ~ years) ord <- order(years) png('sales-years.png') plot(x=years[ord], y=sales[ord]) abline(fit) dev.off()
So far we have only looked at pairs of data vectors; can we find more structure when we look at the data as a whole?
In other words, can we improve our model with the formula (sales ~ years + salary) ?
In this case we have M = 2 independent variables and N = 10 observations.
Generally, for M independent variables, Matrix X with N rows and M columns:
Find β0, β1, ..., βM in yi = β0 + β1 Xi1 + ... + βm Xim + εi such that ∑ εi2 is minimal
summary(lm(sales ~ salary + years)) Call: lm(formula = sales ~ salary + years) Residuals: Min 1Q Median 3Q Max -2.32370 -1.48790 -0.04819 1.42972 2.52621 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 30.1178862 6.8732097 4.382 0.003227 ** salary -0.0012532 0.0004384 -2.859 0.024384 * years 2.9772445 0.4739336 6.282 0.000411 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.027 on 7 degrees of freedom Multiple R-squared: 0.9072, Adjusted R-squared: 0.8807 F-statistic: 34.21 on 2 and 7 DF, p-value: 0.0002435
Further comparison reveals: