When a cause and effect relationship between observed variables is suspected the linear model is a simple yet powerful means to study this relationship. We will approach this subject using an example.
This dataset can be found on the UCI Machine Learning site:
https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
Using the numpy function genfromtxt() we can easily import that data once we have downloaded the file into our the current directory. The file itself does not contain headers, but the UCI site describes the columns as
Let us study a possible relationship between the weight of cars and their fuel efficiency. As a first step it is always useful to plot the data.
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt('auto-mpg.data', usecols=(0,4))
y = data[:,0]
x = data[:,1]
plt.plot(x, y, 'o')
As suspected, heavier cars have lower values of mpg. The relationship is apparent from the plot, but how can we quantify this idea? Statistics provide commonly used tools, such as the correlation.
The Pearson correlation coefficient is
$$r_{x,y} = \frac{s_{x,y}}{s_x s_y}$$
with sample covariance
$$s_{x,y} = \frac{\sum (x_i-\bar{x}) (y_i - \bar{y})}{N-1}$$
The sample standard deviation is
$$s_x = \sqrt s_{x,x}$$
Just for once, to hone our programming skills and motivate these concepts a little further, let us compute these values ourselves first, and the compare our results with those from a Python package.
def scov(x, y):
return sum((x-np.mean(x))*(y-np.mean(y))) / (len(x)-1)
def sdev(x):
return np.sqrt(scov(x, x))
def pcor(x, y):
return scov(x, y) / (sdev(x) * sdev(y))
print(pcor(x, y))
That value looks reasonable. It is negative since increasing weight results in lower mpg. It is also rather close to -1 meaning that the relationship is strong, as can be expected from the plot. Now let us check our result with the SciPy implementation:
from scipy.stats import pearsonr
print(pearsonr(x, y))
Here we also get the two-tailed p-value which is very low indicating that the probability of arriving at this value by chance is very low. What do we mean by chance? Let us do a little experiment:
x_, y_ = x[:10], y[:10]
print(pearsonr(x_, y_))
Here we took only the first 10 values of x and y and again computed the correlation. Note that this value is now even a little lower i.e. indicating a stronger negative correlation; however, the p-value is now no longer extremely low as before. Clearly, with decreasing sample size the chance of getting very different results increases:
x_, y_ = x[:7], y[:7]
print(pearsonr(x_, y_))
Now we arrive at a result that would only be considered significant at the level of $\alpha = 0.05$ but not at $\alpha = 0.02$.
In the most basic case the variable y depends on only one variable x. For each $x_i, y_i$ in a set of observations we assume
$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$
We are looking for parameters $\beta_0$ and $\beta_1$ such that the sum of squared errors $\sum \epsilon_i^2$ is minimal.
In other words, we are looking for a straight line in our data that captures the idea of that linear relationship, for several reasons:
b0 = 46.32
b1 = -0.0077
plt.plot(x, y, 'o')
plt.plot([1500, 5000], [b0+b1*1500, b0+b1*5000], linewidth=3)
plt.plot()
Just by looking at the plot we can already see a basic limitation of the linear model for prediction, and a little computation confirms this:
print(y[np.argmax(x)])
print(b0+b1*np.max(x))
print(b0+b1*6100)
For values of x above 5000 the straight line goes towards zero. The heaviest car still has an mpg value of 13, but our model predicts 6.7. In other words, the model only works moderately well within certain limits. As we approach the minimum and maximum x-values the corresponding y-values from the model tend to become less useful, and after a certain point they lose all plausibility.
Nevertheless, when we are aware of these limitations the linear model is a very useful tool. While the computation of the correlation coefficient was still easy to do with our programming skills, the derivation of the beta values is somewhat more elaborate, and we will trust the statsmodels package with this task.
This package demands an additional column in the input data for the constant, otherwise $\beta_0$ for the intercept will not be shown.
import statsmodels.api as sm
f = sm.OLS(y, sm.add_constant(x)).fit()
print(f.summary())
The output may appear a little overwhelming, but we can just pick what we want:
def mysum(f):
print(f.summary().tables[1])
print('R-squared:', round(f.rsquared, 3))
mysum(f)
The essential results of the fit are:
Both p-values are highly significant. The R-squared value represents how well the model works in predicting or fitting the data (to be precise, how much of the variation around the mean is explained by the model). This goodness-of-fit is a measure for the whole model, while the p-values are interpreted with respect to the individual beta values.
As a crude rule of thumb, a linear model with R-squared below 0.7 is usually not accepted as a good fit. Note that in the case of the simple regression R-squared is identical to
print(pearsonr(x,y)[0]**2)
The logical extension of the simple regression model is the introduction of additional variables which are suspected of having an effect on the dependent variable.
With a package like statsmodels it is very easy to study more elaborate linear models, such as
$$ y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i $$
Let us assume that we are interested in the effect of the weight and the model year on the fuel consumption; it seems reasonably to expect newer cars to be a little more fuel efficient.
When dealing with multiple variables we want named columns. The Pandas package provides this and many other nice features, but unfortunately it also does many things differently from Numpy, which can be quite confusing. This includes the indexing, which is most consistent with Numpy by using the iloc notation.
import pandas as pd
df = pd.read_csv('auto-mpg.data', delimiter='\\s+', na_values='?',
usecols = [0,3,4,6,7],
names = ['mpg', 'hp', 'weight', 'year', 'ori'])
print(df)
We now have several independent variables, and we can use e.g. weight and model year for the estimation of the beta values which works in exactly the same way:
f = sm.OLS(df['mpg'], sm.add_constant(df[['weight', 'year']])).fit()
mysum(f)
The addition of the model year has increased the R-squared value to about 0.8 which is usually taken to indicate a moderately good fit. The individual p-values are still all highly significant.
The beta value for the second independent variable (our model year) is positive, indicating that a higher i.e. later model year is indeed associated with a higher mpg value. In other words, according to this dataset newer car models are more fuel-efficient.
Adding more variables to the model provides additional insights, but unfortunately even this curated dataset is not perfect: it contains missing values. This situation is commonly encountered in data processing and must be addressed.
In our dataset the horsepower is missing for 6 observations.
print(sum(df['hp'].isnull()))
There are several approaches to dealing with missing values:
We will take the third approach. NaN is used in Numpy for not a number; the function dropna() removes rows with missing values from a dataframe.
df = df.dropna()
print(df.shape)
f = sm.OLS(df['mpg'], sm.add_constant(df[['weight', 'year', 'hp']])).fit()
mysum(f)
Note that the number of observations in the summary has gone down to 392. This is the original 398 minus the 6 with missing horsepower values.
Note also that the addition of the horsepower did not improve our model. The R-squared value is unchanged, and the p-value for hp shows that the horsepower is not significant. It should not be included in the model.
Looking at the description of the dataset again we find that column number 8 contains an origin code. From comparing model names and origin codes we can establish that
If we want to study the effect of the origin on the fuel efficiency it would be tempting to simple include that column in the import and perform the linear model as before.
f = sm.OLS(df['mpg'], sm.add_constant(df[['weight', 'year', 'ori']])).fit()
mysum(f)
This particular order of the codes actually has a matching order of fuel efficiency: US < Europe < Asia. However, the codes do not need to be assigned in this particular fashion. A more robust solution is the introduction of dummy variables, each indicating the presence or absence of a certain feature.
df = pd.get_dummies(df, columns=['ori'])
print(df)
f = sm.OLS(df['mpg'], sm.add_constant(df[['weight', 'year', 'ori_1', 'ori_2', 'ori_3']])).fit()
mysum(f)
The dummy variables allow us to see the effect of the origin more clearly: all coefficients are negative, which can be interpreted as a sort of penalty a car gets for coming from that particular region:
The summary above is often too detailed; the individual elements in the result can be accessed as well, see the documentation for details:
print(f.params)
print(f.pvalues)
print(f.rsquared)
EXERCISES:
Find more data files on the UCI site for linear modelling. There are literally hundreds of files waiting to be analysed. Dig into that treasure of data!