Correlations and Linear Models

Correlation

A question related to dependence vs independence is whether there is an association between two rationally scaled sample variables, such as salary and sales. We can ask: How strong is that association?

The Pearson correlation coefficient is r_x,y = s_x,y / (s_x s_y) with sample covariance s_x,y = 1/(N-1) Σ (x_i-m_x) (y_i - m_y)

The sample standard deviation is s_x = sqrt(s_x,x)

  cor(salary, sales)
[1] 0.6196906

The estimated correlation of 0.62 is not strong.

The cor.test() shows that at α = 0.05 this value is not significant.

  cor.test(sales, salary)

	Pearson's product-moment correlation

data:  sales and salary 
t = 2.2332, df = 8, p-value = 0.05601
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.01629263  0.89867697 
sample estimates:
      cor 
0.6196906

The idea of the test is that we have drawn a small sample of N values from a large population. If the true correlation in the population is r=0, then the statistic t = r / √ ( (1-r²) / (N-2) ) is approximately t-distributed with df=N-2, for N≥6 and normally distributed underlying variables.

The corresponding p-value in the example is 0.05601; in other words, there is a rather high probability of 0.05601 that with our small sample we arrive at the estimate of r=0.62 when the actual correlation in the population is zero.

Linear Models

Another way of studying the association between variables is to fit a linear model.

For vectors x and y: Find β₀, β₁ in y_i = β₀ + β₁ x_i + ε_i such that ∑ ε_i² is minimal

Recall the values for x and y:

  salary
 [1] 20495 22061 18464 23335 19658 22423 23552 21914 15000 15000
  sales
 [1] 20 17 24 19 24 24 21 29 13  9

Fitting the model in R uses the symbol "~" in the formula:

  fit <- lm(sales ~ salary)
  fit

Call:
lm(formula = sales ~ salary)

Coefficients:
(Intercept)       salary  
  -3.236122     0.001151

Printing the fitted model only shows the estimates for intercept and gradient, in this case salary. Strangely enough, the summary() function provides more information:

  summary(fit)

Call:
lm(formula = sales ~ salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1530 -4.1817 -0.6888  3.8170  7.0161 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -3.2361224 10.5187490  -0.308    0.766  
salary       0.0011509  0.0005153   2.233    0.056 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 4.886 on 8 degrees of freedom
Multiple R-squared: 0.384,	Adjusted R-squared: 0.307 
F-statistic: 4.987 on 1 and 8 DF,  p-value: 0.05601

The first data column indicates the estimates for β₀ and β₁.

The estimates divided by their standard errors result in t-values, e.g. for salary

  0.0011509 / 0.0005153
[1] 2.233456

Pr(>|t|) is the probability of obtaining an estimate with at least that t-value if the actual coefficient is zero. Since all probabilities are higher than 0.05, none of the estimates is significant at α = 0.05.

The actual values y_i are:

  sales
 [1] 20 17 24 19 24 24 21 29 13  9

To gain an idea on how well the model performs for each data point we can look at the fitted values and the residuals:

  fitted(fit)
       1        2        3        4        5        6        7        8 
20.35078 22.15303 18.01338 23.61923 19.38751 22.56964 23.86897 21.98385 
       9       10 
14.02680 14.02680 
  resid(fit)
         1          2          3          4          5          6          7 
-0.3507826 -5.1530316  5.9866170 -4.6192290  4.6124885  1.4303566 -2.8689659 
         8          9         10 
 7.0161451 -1.0267990 -5.0267990

Residual standard error is defined as √(SSE/(n-2)) for estimating two parameters β₀ and β₁, with
SSE = Σ (y_i - y_i^{^})²

    sqrt(sum(resid(fit)^2)/(length(sales)-2))
[1] 4.885628

About 2/3 of the residuals are in the range ±4.9 (if the residuals are approximately normal).

Multiple R-squared is the squared correlation coefficient, indicating the ratio of explained variation and total variation (from the mean of the y_i values); a ratio close to 1 indicates a good fit.

Note that for each data point there is a variation from the mean composed of y_i^{^} - m_y and y_i - y_i^{^}.

This leads to the definition of R² as the fraction of explained and total variation:

SSR = Σ (y_i^{^} - m_y)²
SSE = Σ (y_i - y_i^{^})²
SST = SSR + SSE
R² = SSR / SST

Calculating these values confirms the model result:

  ssr = sum((fitted(fit) - mean(sales))^2)
  sse = sum((sales - fitted(fit))^2)
  ssr
[1] 119.0451
  sse
[1] 190.9549
  ssr/(ssr+sse)
[1] 0.3840165

Note that in this case, since this R-square is not actually multiple:

    cor(sales, salary)^2
[1] 0.3840165

The Adjusted R-square takes the number of parameters into account.

The F-statistic aims at testing H₀: all β_i>0 = 0 against H_a: at least one of β_i>0 is non-zero

in other words we compare the small model Y_i = β₀ + ε_i with the full model Y_i = β₀ + β₁ X_i + ε_i

To interpret the F-statistic and the p-value, we take a look at the ANOVA table:

anova(m) Analysis of Variance Table Response: sales Df Sum Sq Mean Sq F value Pr(>F) salary 1 119.05 119.045 4.9874 0.05601 . Residuals 8 190.96 23.869 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 119.045 / 23.869 [1] 4.987431

The Analysis of Variance Table provides data on the variance explained by the model, here under Mean Sq for salary, and the residual or unexplained variance, Mean Sq for Residuals. The ratio of those two is the F-value. In this case the F-value is small, the associated probability for this F-value is higher than α = 0.05, and we cannot reject H₀.

This model does not provide a good fit. None of the estimates are significant at α = 0.05. R-square and p-value of the model are not at all satisfactory.

To see why, we plot the data together with the fitted model:

  png('sales-salary.png')
  plot(x=salary, y=sales)
  abline(fit)
  dev.off()

Looking at the plot, we observe that

there is of course still not much structure in the data
the fitted line does not look convincing

We have not yet used the information on how long the employees have been in the company:

  years
 [1] 5 5 5 6 7 8 7 8 0 0
  salary
 [1] 20495 22061 18464 23335 19658 22423 23552 21914 15000 15000
  sales
 [1] 20 17 24 19 24 24 21 29 13  9

  cor(sales, years)
[1] 0.8937891
  cor(salary, years)
[1] 0.8729632

Sales and years are highly correlated, therefore we try to fit a linear model:

  summary(lm(sales ~ years))

Call:
lm(formula = sales ~ years)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8205 -2.2692 -0.5124  1.7617  4.1795 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.8479     1.8482   5.870 0.000374 ***
years         1.7945     0.3184   5.637 0.000489 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.792 on 8 degrees of freedom
Multiple R-squared: 0.7989,	Adjusted R-squared: 0.7737 
F-statistic: 31.77 on 1 and 8 DF,  p-value: 0.0004889

This model is much better in terms of R-square and p-values than (sales ~ salary). With very high significance for both estimates it is completely acceptable. The number of years clearly explains the sales values.

To see why, we sort sales by order of years, and make another plot:

  fit <- lm(sales ~ years)
  ord <- order(years)
  png('sales-years.png')
  plot(x=years[ord], y=sales[ord])
  abline(fit)
  dev.off()

there is now a cleary visible trend in the data points
the fitted line represents this trend nicely

So far we have only looked at pairs of data vectors; can we find more structure when we look at the data as a whole?

In other words, can we improve our model with the formula (sales ~ years + salary) ?

In this case we have M = 2 independent variables and N = 10 observations.

Generally, for M independent variables, Matrix X with N rows and M columns:

Find β₀, β₁, ..., β_M in y_i = β₀ + β₁ X_i1 + ... + β_m X_im + ε_i such that ∑ ε_i² is minimal

  summary(lm(sales ~ salary + years))

Call:
lm(formula = sales ~ salary + years)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32370 -1.48790 -0.04819  1.42972  2.52621 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.1178862  6.8732097   4.382 0.003227 ** 
salary      -0.0012532  0.0004384  -2.859 0.024384 *  
years        2.9772445  0.4739336   6.282 0.000411 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.027 on 7 degrees of freedom
Multiple R-squared: 0.9072,	Adjusted R-squared: 0.8807 
F-statistic: 34.21 on 2 and 7 DF,  p-value: 0.0002435

R-square and p-value of the fitted model as a whole improve by adding the salary data
all parameter estimates are significant at least at α = 0.05
however, in the previous model all estimates were significant at α = 0.0005

Further comparison reveals:

in (sales ~ salary)
- the estimate for salary is positive
- at α = 0.05 the model has to be rejected
in (sales ~ salary + years)
- the estimate for salary is negative!
in (sales ~ years) and (sales ~ salary + years)
- the estimates for years are similar in magnitude and both positive
- also, their significance level is much higher than salary