A question related to dependence vs independence is whether there is an association between two rationally scaled sample variables, such as salary and sales.
Correlation and linear models as discussed here are only suited for rationally scaled variables. They must not be performed on other types of data, such as ordinally scaled values, e.g. points from 1 to 5 in reviews. While the methods will work in most cases, the results are meaningless since the interpretation based on a rational scale is wrong: 4 points in a review does not mean twice as good as 2 points.
We can ask: How strong is the association between two variables $x$ and $y$?
The Pearson correlation coefficient is
$$r_{x,y} = \frac{s_{x,y}}{s_x s_y}$$with sample covariance
$$s_{x,y} = \frac{\sum (x_i - \mu_x) (y_i - \mu_y)}{N-1}$$and standard deviation
$$s_x = \sqrt{s_{x,x}}$$%precision 4
%matplotlib inline
%reload_ext rpy2.ipython
Assume that we are interested in the effect of the amount of money we pay the salesforce.
Our values for x and y are pay and sales. Those values are on a rational scale.
%%R
pay = c(2050, 2200, 1850, 2330, 1970, 2240, 2360, 2190, 1500, 1500)
sales = c(20, 17, 24, 19, 24, 24, 21, 29, 13, 9)
pay = (2050, 2200, 1850, 2330, 1970, 2240, 2360, 2190, 1500, 1500)
sales = (20, 17, 24, 19, 24, 24, 21, 29, 13, 9)
Since the values are rationally scaled the Pearson correlation coefficient can be computed:
%%R
cor.test(pay, sales)
from scipy import stats
scipy.set_printoptions(precision = 4, suppress = True)
r, p = stats.pearsonr(pay, sales)
r, p
The estimated correlation of $r$ = 0.6233 is not strong. As a very rough guideline, correlations with $r <$ 0.7 are not considered strong.
A correlation test is also performed, and it shows that even at the generous significance level of $\alpha$ = 0.05 the correlation is not significant.
The idea of the test is that we have drawn a small sample of $N$ values from a large population. If the true correlation in the population is zero, then the statistic
$$ t = \frac{r}{ \sqrt{ \frac{(1-r^2)}{ (N-2) } }} $$is approximately t-distributed with $df=N-2$, for $N\ge 6$ and normally distributed underlying variables.
The p-value in the example is higher than $\alpha$ = 0.05; in other words, there is a rather high probability that with our small sample we arrive at an estimate of $r$ = 0.6233 or higher when the actual correlation in the population is zero.
Another way of studying the association between variables is to fit a linear model. We are using the ordinary least squares method (OLS):
Find $\beta_0, \beta_1$ in $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ such that $\sum \epsilon_i^2$ is minimal
%%R
summary(lm(sales ~ pay))
For fitting the model with the Python module statsmodels we need to construct the design matrix, in this case a matrix containing 1's in the first column, and the x values in the second column. The patsy module helps with this, although the parameter data needs a dictionary for the values of the independent variables.
import statsmodels.api as m, patsy
X = patsy.dmatrix("pay", data = { "pay": pay })
res = m.OLS(sales, X).fit()
print res.params, res.pvalues, res.rsquared, res.f_pvalue
The results of the linear model estimation provide
the F-statistic and the corresponding p-value for testing against
$y_i = \beta_0 + \epsilon_i$
which means estimating all $y_i$ with their mean value $\beta_0 = \bar{y}$
Parameter Estimates
In this case the parameters are $\beta_0$ for the intercept, and $\beta_1$ for salary.
The ordinary least squares estimates for parameters $\beta_0$ and $\beta_1$ which are given in the params result
print res.params
How are the p-values for the parameters derived? The parameter estimates divided by their standard errors result in t-values:
print res.bse
print res.tvalues
The t-distribution provides the p-values corresponding to the t-values.
print res.pvalues
A p-value in this case is the probability of obtaining an estimate with at least that t-value if the actual coefficient is zero. Since all p-values are higher than 0.05, none of the estimates is significant at $\alpha$ = 0.05.
R-squared
The p-values for the parameters provide information on the individual parameter estimates. The R-squared value concerns the whole model. It indicates the ratio of explained variation and total variation around the mean of the $y_i$ values; a ratio close to 1 indicates a good fit.
print res.rsquared
Note that for each data point there is a variation from the mean composed of $\hat{y}_i - \bar{y}$ and $y_i - \hat{y}_i$
This leads to the definition of $R^2$ as the fraction of explained and total variation around the mean:
Regression sum of squares $SSR = \Sigma (\hat{y}_i - \bar{y})^2$
Error sum of squares $SSE = \Sigma (y_i - \hat{y}_i)^2$
$R^2 = \frac{SSR}{SSR + SSE}$
The $R^2$ value in this example is very low, indicating that the model does not provide a good fit.
It is instructive to look at the fitted values of the model i.e. the values $\hat{y}_i$ that the model calculates for each $x_i$ value:
print res.fittedvalues
By plotting the fitted values and the actual values against the independent variable we gain an understanding of the shortcomings of this model:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(pay, sales, 'o')
plt.plot(pay, res.fittedvalues, '-')
plt.title("Sales by Pay, Fitted and Actual Values")
We observe that
F-statistic
The F-statistic is another important indicator for the model; it aims at testing
$H_0$: all $ \beta_{i>0} = 0 $
against
$H_a$: at least one of $ \beta_{i>0} $ is non-zero
In this example, we compare the small model $ y_i = \beta_0 + \epsilon_i $ with the full model $ y_i = \beta_0 + \beta_1 x_i + \epsilon_i $
From the F-value we can compute a corresponding p-value, which is again (not suprisingly in this case) not significant at $ \alpha = 0.05 $.
print res.fvalue
print res.f_pvalue
Other Modeling Choices
Assume that we also have the data on how long the employees have been in the company:
years = (5, 5, 5, 6, 7, 8, 7, 8, 0, 0)
print stats.pearsonr(sales, years)
Sales and years are highly correlated, therefore we add the variable years to our model, so we are now estimating
$$ y_i = \beta_0 + \beta_1 a_i + \beta_2 b_i + \epsilon_i $$where $a_i$ and $b_i$ are the values for salary and years, respectively.
%%R
years <- c(5, 5, 5, 6, 7, 8, 7, 8, 0, 0)
summary(lm(sales ~ pay + years))
res = lm(sales, patsy.dmatrix("pay + years", data = { "pay": pay, "years": years } ))
The resulting model is quite different from the previous one:
Remember that the correlation of sales and salary was positive, yet in the linear model a better fit in terms of least squares can be achieved by using a negative parameter estimate. The last point clearly shows that the output from linear modelling has to be interpreted with great care. Does it make sense that people will sell more if they are paid less?
Compare these results with another model using only years as the independent variable:
%%R
summary(lm(sales ~ years))
res = lm(sales, patsy.dmatrix("years", data = { "years": years } ))
The $R^2$ value has gone down, but both parameter estimates are highly significant, and there is a meaningful interpretation: people who are in the salesforce for a longer period of time tend to sell more. The plot of estimated and actual values shows a moderately adequate fit.
plt.figure()
plt.plot(years, sales, 'o')
plt.plot(years, res.fittedvalues, '-')
plt.title("Sales by Years, Fitted and Actual Values")
There are many choices in statistical modeling. It is up to the good judgement of the modeler to select the most useful model.