Correlation and Linear Models¶

A question related to dependence vs independence is whether there is an association between two rationally scaled sample variables, such as salary and sales.

Correlation and linear models as discussed here are only suited for rationally scaled variables. They must not be performed on other types of data, such as ordinally scaled values, e.g. points from 1 to 5 in reviews. While the methods will work in most cases, the results are meaningless since the interpretation based on a rational scale is wrong: 4 points in a review does not mean twice as good as 2 points.

Correlation¶

We can ask: How strong is the association between two variables $x$ and $y$?

The Pearson correlation coefficient is

$$r_{x,y} = \frac{s_{x,y}}{s_x s_y}$$

with sample covariance

$$s_{x,y} = \frac{\sum (x_i - \mu_x) (y_i - \mu_y)}{N-1}$$

and standard deviation

$$s_x = \sqrt{s_{x,x}}$$

%precision 4
%matplotlib inline
%reload_ext rpy2.ipython

Assume that we are interested in the effect of the amount of money we pay the salesforce.

According to the available data, do people who earn more actually sell more?

Our values for x and y are pay and sales. Those values are on a rational scale.

%%R
pay = c(2050, 2200, 1850, 2330, 1970, 2240, 2360, 2190, 1500, 1500)
sales = c(20, 17, 24, 19, 24, 24, 21, 29, 13,  9)

pay = (2050, 2200, 1850, 2330, 1970, 2240, 2360, 2190, 1500, 1500)
sales = (20, 17, 24, 19, 24, 24, 21, 29, 13,  9)

Since the values are rationally scaled the Pearson correlation coefficient can be computed:

%%R
cor.test(pay, sales)

	Pearson's product-moment correlation

data:  pay and sales
t = 2.2547, df = 8, p-value = 0.05417
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.01033849  0.89981652
sample estimates:
      cor 
0.6233454

from scipy import stats

scipy.set_printoptions(precision = 4, suppress = True)
r, p = stats.pearsonr(pay, sales)
r, p

(0.6233, 0.0542)

The estimated correlation of $r$ = 0.6233 is not strong. As a very rough guideline, correlations with $r <$ 0.7 are not considered strong.

A correlation test is also performed, and it shows that even at the generous significance level of $\alpha$ = 0.05 the correlation is not significant.

The idea of the test is that we have drawn a small sample of $N$ values from a large population. If the true correlation in the population is zero, then the statistic

$$ t = \frac{r}{ \sqrt{ \frac{(1-r^2)}{ (N-2) } }} $$

is approximately t-distributed with $df=N-2$, for $N\ge 6$ and normally distributed underlying variables.

The p-value in the example is higher than $\alpha$ = 0.05; in other words, there is a rather high probability that with our small sample we arrive at an estimate of $r$ = 0.6233 or higher when the actual correlation in the population is zero.

Linear Models¶

Another way of studying the association between variables is to fit a linear model. We are using the ordinary least squares method (OLS):

Find $\beta_0, \beta_1$ in $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ such that $\sum \epsilon_i^2$ is minimal

%%R
summary(lm(sales ~ pay))

Call:
lm(formula = sales ~ pay)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1004 -4.1960 -0.6685  3.7853  7.0157 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -3.429239  10.504489  -0.326   0.7525  
pay          0.011604   0.005147   2.255   0.0542 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.868 on 8 degrees of freedom
Multiple R-squared:  0.3886,	Adjusted R-squared:  0.3121 
F-statistic: 5.084 on 1 and 8 DF,  p-value: 0.05417

For fitting the model with the Python module statsmodels we need to construct the design matrix, in this case a matrix containing 1's in the first column, and the x values in the second column. The patsy module helps with this, although the parameter data needs a dictionary for the values of the independent variables.

import statsmodels.api as m, patsy

X = patsy.dmatrix("pay", data = { "pay": pay })
res = m.OLS(sales, X).fit()
print res.params, res.pvalues, res.rsquared, res.f_pvalue

 [-3.4292  0.0116] [ 0.7525  0.0542] 0.388559492722 0.0541650896641

The results of the linear model estimation provide

estimates for the parameters, in this case $\beta_0$ and $\beta_1$
the p-values for the parameter estimates
the R-squared value for the variance explained
the F-statistic and the corresponding p-value for testing against

$y_i = \beta_0 + \epsilon_i$

which means estimating all $y_i$ with their mean value $\beta_0 = \bar{y}$

Parameter Estimates

In this case the parameters are $\beta_0$ for the intercept, and $\beta_1$ for salary.

The ordinary least squares estimates for parameters $\beta_0$ and $\beta_1$ which are given in the params result

print res.params

[-3.4292  0.0116]

How are the p-values for the parameters derived? The parameter estimates divided by their standard errors result in t-values:

print res.bse
print res.tvalues

[ 10.5045   0.0051]
[-0.3265  2.2547]

The t-distribution provides the p-values corresponding to the t-values.

print res.pvalues

[ 0.7525  0.0542]

A p-value in this case is the probability of obtaining an estimate with at least that t-value if the actual coefficient is zero. Since all p-values are higher than 0.05, none of the estimates is significant at $\alpha$ = 0.05.

R-squared

The p-values for the parameters provide information on the individual parameter estimates. The R-squared value concerns the whole model. It indicates the ratio of explained variation and total variation around the mean of the $y_i$ values; a ratio close to 1 indicates a good fit.

print res.rsquared

0.388559492722

Note that for each data point there is a variation from the mean composed of $\hat{y}_i - \bar{y}$ and $y_i - \hat{y}_i$

This leads to the definition of $R^2$ as the fraction of explained and total variation around the mean:

Regression sum of squares $SSR = \Sigma (\hat{y}_i - \bar{y})^2$

Error sum of squares $SSE = \Sigma (y_i - \hat{y}_i)^2$

$R^2 = \frac{SSR}{SSR + SSE}$

The $R^2$ value in this example is very low, indicating that the model does not provide a good fit.

It is instructive to look at the fitted values of the model i.e. the values $\hat{y}_i$ that the model calculates for each $x_i$ value:

print res.fittedvalues

[ 20.3597  22.1004  18.0389  23.609   19.4314  22.5646  23.9571  21.9843
  13.9773  13.9773]

By plotting the fitted values and the actual values against the independent variable we gain an understanding of the shortcomings of this model:

import matplotlib.pyplot as plt

plt.figure()
plt.plot(pay, sales, 'o')
plt.plot(pay, res.fittedvalues, '-')
plt.title("Sales by Pay, Fitted and Actual Values")

<matplotlib.text.Text at 0x7f3ef9ddd590>

We observe that

there is not much structure in the data
the fitted line does not look convincing

F-statistic

The F-statistic is another important indicator for the model; it aims at testing

$H_0$: all $ \beta_{i>0} = 0 $

against

$H_a$: at least one of $ \beta_{i>0} $ is non-zero

In this example, we compare the small model $ y_i = \beta_0 + \epsilon_i $ with the full model $ y_i = \beta_0 + \beta_1 x_i + \epsilon_i $

From the F-value we can compute a corresponding p-value, which is again (not suprisingly in this case) not significant at $ \alpha = 0.05 $.

print res.fvalue
print res.f_pvalue

5.08385673631
0.0541650896641

Other Modeling Choices

Assume that we also have the data on how long the employees have been in the company:

years = (5, 5, 5, 6, 7, 8, 7, 8, 0, 0)
print stats.pearsonr(sales, years)

(0.89378913203421329, 0.00048887601712256981)

Sales and years are highly correlated, therefore we add the variable years to our model, so we are now estimating

$$ y_i = \beta_0 + \beta_1 a_i + \beta_2 b_i + \epsilon_i $$

where $a_i$ and $b_i$ are the values for salary and years, respectively.

%%R
years <- c(5, 5, 5, 6, 7, 8, 7, 8, 0, 0)
summary(lm(sales ~ pay + years))

Call:
lm(formula = sales ~ pay + years)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32297 -1.50684 -0.05361  1.43090  2.49140 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.352541   6.945891   4.370  0.00327 ** 
pay         -0.012686   0.004433  -2.862  0.02427 *  
years        2.992403   0.478080   6.259  0.00042 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.026 on 7 degrees of freedom
Multiple R-squared:  0.9073,	Adjusted R-squared:  0.8808 
F-statistic: 34.26 on 2 and 7 DF,  p-value: 0.0002424

res = lm(sales, patsy.dmatrix("pay + years", data = { "pay": pay, "years": years } ))

params: [ 30.3525  -0.0127   2.9924]
p-values: [ 0.0033  0.0243  0.0004]
R-squared: 0.907312933202
F-statistic p-value: 0.000242419183969

The resulting model is quite different from the previous one:

The p-values indicate that all parameters are significant at $ \alpha = 0.05 $
The $R^2$ value is very high, indicating a good fit.
The F-statistic also indicates a good model.
However, the estimate for salary is now negative.

Remember that the correlation of sales and salary was positive, yet in the linear model a better fit in terms of least squares can be achieved by using a negative parameter estimate. The last point clearly shows that the output from linear modelling has to be interpreted with great care. Does it make sense that people will sell more if they are paid less?

Compare these results with another model using only years as the independent variable:

%%R
summary(lm(sales ~ years))

Call:
lm(formula = sales ~ years)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8205 -2.2692 -0.5124  1.7617  4.1795 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.8479     1.8482   5.870 0.000374 ***
years         1.7945     0.3184   5.637 0.000489 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.792 on 8 degrees of freedom
Multiple R-squared:  0.7989,	Adjusted R-squared:  0.7737 
F-statistic: 31.77 on 1 and 8 DF,  p-value: 0.0004889

res = lm(sales, patsy.dmatrix("years", data = { "years": years } ))

params: [ 10.8479   1.7945]
p-values: [ 0.0004  0.0005]
R-squared: 0.798859012542
F-statistic p-value: 0.000488876017123

The $R^2$ value has gone down, but both parameter estimates are highly significant, and there is a meaningful interpretation: people who are in the salesforce for a longer period of time tend to sell more. The plot of estimated and actual values shows a moderately adequate fit.

plt.figure()
plt.plot(years, sales, 'o')
plt.plot(years, res.fittedvalues, '-')
plt.title("Sales by Years, Fitted and Actual Values")

<matplotlib.text.Text at 0x7f3ef9d1e810>

There are many choices in statistical modeling. It is up to the good judgement of the modeler to select the most useful model.