Correlation

A question related to dependence vs independence is whether there is an association between two rationally scaled sample variables, such as salary and sales. We can ask: How strong is that association?

The Pearson correlation coefficient is rx,y = sx,y / (sx sy) with sample covariance sx,y = 1/(N-1) Σ (xi-mx) (yi - my)

The sample standard deviation is sx = sqrt(sx,x)

  cor(salary, sales)
[1] 0.6196906

The estimated correlation of 0.62 is not strong.

The cor.test() shows that at α = 0.05 this value is not significant.

  cor.test(sales, salary)

	Pearson's product-moment correlation

data:  sales and salary 
t = 2.2332, df = 8, p-value = 0.05601
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.01629263  0.89867697 
sample estimates:
      cor 
0.6196906 

The idea of the test is that we have drawn a small sample of N values from a large population. If the true correlation in the population is r=0, then the statistic t = r / √ ( (1-r2) / (N-2) ) is approximately t-distributed with df=N-2, for N≥6 and normally distributed underlying variables.

The corresponding p-value in the example is 0.05601; in other words, there is a rather high probability of 0.05601 that with our small sample we arrive at the estimate of r=0.62 when the actual correlation in the population is zero.

Linear Models

Another way of studying the association between variables is to fit a linear model.

For vectors x and y: Find β0, β1 in yi = β0 + β1 xi + εi such that ∑ εi2 is minimal

Recall the values for x and y:

  salary
 [1] 20495 22061 18464 23335 19658 22423 23552 21914 15000 15000
  sales
 [1] 20 17 24 19 24 24 21 29 13  9

Fitting the model in R uses the symbol "~" in the formula:

  fit <- lm(sales ~ salary)
  fit

Call:
lm(formula = sales ~ salary)

Coefficients:
(Intercept)       salary  
  -3.236122     0.001151  

Printing the fitted model only shows the estimates for intercept and gradient, in this case salary. Strangely enough, the summary() function provides more information:

  summary(fit)

Call:
lm(formula = sales ~ salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1530 -4.1817 -0.6888  3.8170  7.0161 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -3.2361224 10.5187490  -0.308    0.766  
salary       0.0011509  0.0005153   2.233    0.056 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 4.886 on 8 degrees of freedom
Multiple R-squared: 0.384,	Adjusted R-squared: 0.307 
F-statistic: 4.987 on 1 and 8 DF,  p-value: 0.05601 

The first data column indicates the estimates for β0 and β1.

The estimates divided by their standard errors result in t-values, e.g. for salary

  0.0011509 / 0.0005153
[1] 2.233456

Pr(>|t|) is the probability of obtaining an estimate with at least that t-value if the actual coefficient is zero. Since all probabilities are higher than 0.05, none of the estimates is significant at α = 0.05.

The actual values yi are:

  sales
 [1] 20 17 24 19 24 24 21 29 13  9

To gain an idea on how well the model performs for each data point we can look at the fitted values and the residuals:

  fitted(fit)
       1        2        3        4        5        6        7        8 
20.35078 22.15303 18.01338 23.61923 19.38751 22.56964 23.86897 21.98385 
       9       10 
14.02680 14.02680 
  resid(fit)
         1          2          3          4          5          6          7 
-0.3507826 -5.1530316  5.9866170 -4.6192290  4.6124885  1.4303566 -2.8689659 
         8          9         10 
 7.0161451 -1.0267990 -5.0267990 

Residual standard error is defined as √(SSE/(n-2)) for estimating two parameters β0 and β1, with
SSE = Σ (yi - yi^)2

    sqrt(sum(resid(fit)^2)/(length(sales)-2))
[1] 4.885628

About 2/3 of the residuals are in the range ±4.9 (if the residuals are approximately normal).

Multiple R-squared is the squared correlation coefficient, indicating the ratio of explained variation and total variation (from the mean of the yi values); a ratio close to 1 indicates a good fit.

Note that for each data point there is a variation from the mean composed of yi^ - my and yi - yi^.

This leads to the definition of R2 as the fraction of explained and total variation:

SSR = Σ (yi^ - my)2
SSE = Σ (yi - yi^)2
SST = SSR + SSE
R2 = SSR / SST

Calculating these values confirms the model result:

  ssr = sum((fitted(fit) - mean(sales))^2)
  sse = sum((sales - fitted(fit))^2)
  ssr
[1] 119.0451
  sse
[1] 190.9549
  ssr/(ssr+sse)
[1] 0.3840165

Note that in this case, since this R-square is not actually multiple:

    cor(sales, salary)^2
[1] 0.3840165

The Adjusted R-square takes the number of parameters into account.

The F-statistic aims at testing H0: all βi>0 = 0 against Ha: at least one of βi>0 is non-zero

in other words we compare the small model Yi = β0 + εi with the full model Yi = β0 + β1 Xi + εi

To interpret the F-statistic and the p-value, we take a look at the ANOVA table:

  anova(m)
Analysis of Variance Table

Response: sales
          Df Sum Sq Mean Sq F value  Pr(>F)  
salary     1 119.05 119.045  4.9874 0.05601 .
Residuals  8 190.96  23.869                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
  119.045 / 23.869
[1] 4.987431

The Analysis of Variance Table provides data on the variance explained by the model, here under Mean Sq for salary, and the residual or unexplained variance, Mean Sq for Residuals. The ratio of those two is the F-value. In this case the F-value is small, the associated probability for this F-value is higher than α = 0.05, and we cannot reject H0.

This model does not provide a good fit. None of the estimates are significant at α = 0.05. R-square and p-value of the model are not at all satisfactory.

To see why, we plot the data together with the fitted model:

  png('sales-salary.png')
  plot(x=salary, y=sales)
  abline(fit)
  dev.off()

Looking at the plot, we observe that

We have not yet used the information on how long the employees have been in the company:

  years
 [1] 5 5 5 6 7 8 7 8 0 0
  salary
 [1] 20495 22061 18464 23335 19658 22423 23552 21914 15000 15000
  sales
 [1] 20 17 24 19 24 24 21 29 13  9

  cor(sales, years)
[1] 0.8937891
  cor(salary, years)
[1] 0.8729632

Sales and years are highly correlated, therefore we try to fit a linear model:

  summary(lm(sales ~ years))

Call:
lm(formula = sales ~ years)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8205 -2.2692 -0.5124  1.7617  4.1795 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.8479     1.8482   5.870 0.000374 ***
years         1.7945     0.3184   5.637 0.000489 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.792 on 8 degrees of freedom
Multiple R-squared: 0.7989,	Adjusted R-squared: 0.7737 
F-statistic: 31.77 on 1 and 8 DF,  p-value: 0.0004889 

This model is much better in terms of R-square and p-values than (sales ~ salary). With very high significance for both estimates it is completely acceptable. The number of years clearly explains the sales values.

To see why, we sort sales by order of years, and make another plot:

  fit <- lm(sales ~ years)
  ord <- order(years)
  png('sales-years.png')
  plot(x=years[ord], y=sales[ord])
  abline(fit)
  dev.off()

So far we have only looked at pairs of data vectors; can we find more structure when we look at the data as a whole?

In other words, can we improve our model with the formula (sales ~ years + salary) ?

In this case we have M = 2 independent variables and N = 10 observations.

Generally, for M independent variables, Matrix X with N rows and M columns:

Find β0, β1, ..., βM in yi = β0 + β1 Xi1 + ... + βm Xim + εi such that ∑ εi2 is minimal

  summary(lm(sales ~ salary + years))

Call:
lm(formula = sales ~ salary + years)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32370 -1.48790 -0.04819  1.42972  2.52621 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.1178862  6.8732097   4.382 0.003227 ** 
salary      -0.0012532  0.0004384  -2.859 0.024384 *  
years        2.9772445  0.4739336   6.282 0.000411 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.027 on 7 degrees of freedom
Multiple R-squared: 0.9072,	Adjusted R-squared: 0.8807 
F-statistic: 34.21 on 2 and 7 DF,  p-value: 0.0002435 

Further comparison reveals: