The Linear Model

When a cause and effect relationship between observed variables is suspected the linear model is a simple yet powerful means to study this relationship. We will approach this subject using an example.

The Auto MPG dataset

This dataset can be found on the UCI Machine Learning site:

https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

Using the numpy function genfromtxt() we can easily import that data once we have downloaded the file into our the current directory. The file itself does not contain headers, but the UCI site describes the columns as

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

Let us study a possible relationship between the weight of cars and their fuel efficiency. As a first step it is always useful to plot the data.

In [642]:
import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('auto-mpg.data', usecols=(0,4))
y = data[:,0]
x = data[:,1]
plt.plot(x, y, 'o')
Out[642]:
[<matplotlib.lines.Line2D at 0x7f00587c92b0>]

As suspected, heavier cars have lower values of mpg. The relationship is apparent from the plot, but how can we quantify this idea? Statistics provide commonly used tools, such as the correlation.

Correlation

The Pearson correlation coefficient is

$$r_{x,y} = \frac{s_{x,y}}{s_x s_y}$$

with sample covariance

$$s_{x,y} = \frac{\sum (x_i-\bar{x}) (y_i - \bar{y})}{N-1}$$

The sample standard deviation is

$$s_x = \sqrt s_{x,x}$$

Just for once, to hone our programming skills and motivate these concepts a little further, let us compute these values ourselves first, and the compare our results with those from a Python package.

In [643]:
def scov(x, y):
    return sum((x-np.mean(x))*(y-np.mean(y))) / (len(x)-1)
        
def sdev(x):
    return np.sqrt(scov(x, x))
               
def pcor(x, y):
    return scov(x, y) / (sdev(x) * sdev(y))
               
print(pcor(x, y))
-0.8317409332443353

That value looks reasonable. It is negative since increasing weight results in lower mpg. It is also rather close to -1 meaning that the relationship is strong, as can be expected from the plot. Now let us check our result with the SciPy implementation:

In [644]:
from scipy.stats import pearsonr

print(pearsonr(x, y))
(-0.8317409332443348, 2.9727995640500577e-103)

Here we also get the two-tailed p-value which is very low indicating that the probability of arriving at this value by chance is very low. What do we mean by chance? Let us do a little experiment:

In [645]:
x_, y_ = x[:10], y[:10]
print(pearsonr(x_, y_))
(-0.8504312206742171, 0.0018204771832296383)

Here we took only the first 10 values of x and y and again computed the correlation. Note that this value is now even a little lower i.e. indicating a stronger negative correlation; however, the p-value is now no longer extremely low as before. Clearly, with decreasing sample size the chance of getting very different results increases:

In [646]:
x_, y_ = x[:7], y[:7]
print(pearsonr(x_, y_))
(-0.7928645992717445, 0.03342831535770518)

Now we arrive at a result that would only be considered significant at the level of $\alpha = 0.05$ but not at $\alpha = 0.02$.

Simple Regression

In the most basic case the variable y depends on only one variable x. For each $x_i, y_i$ in a set of observations we assume

$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$

We are looking for parameters $\beta_0$ and $\beta_1$ such that the sum of squared errors $\sum \epsilon_i^2$ is minimal.

In other words, we are looking for a straight line in our data that captures the idea of that linear relationship, for several reasons:

  • to study the effect of the independent variables, e.g. by looking at their p-values
  • to predict additional data points i.e. values for y that correspond to x-values that are not part of our dataset
In [647]:
b0 = 46.32
b1 = -0.0077
plt.plot(x, y, 'o')
plt.plot([1500, 5000], [b0+b1*1500, b0+b1*5000], linewidth=3)
plt.plot()
Out[647]:
[]

Just by looking at the plot we can already see a basic limitation of the linear model for prediction, and a little computation confirms this:

In [648]:
print(y[np.argmax(x)])
print(b0+b1*np.max(x))
print(b0+b1*6100)
13.0
6.741999999999997
-0.6499999999999986

For values of x above 5000 the straight line goes towards zero. The heaviest car still has an mpg value of 13, but our model predicts 6.7. In other words, the model only works moderately well within certain limits. As we approach the minimum and maximum x-values the corresponding y-values from the model tend to become less useful, and after a certain point they lose all plausibility.

Nevertheless, when we are aware of these limitations the linear model is a very useful tool. While the computation of the correlation coefficient was still easy to do with our programming skills, the derivation of the beta values is somewhat more elaborate, and we will trust the statsmodels package with this task.

This package demands an additional column in the input data for the constant, otherwise $\beta_0$ for the intercept will not be shown.

In [649]:
import statsmodels.api as sm
In [650]:
f = sm.OLS(y, sm.add_constant(x)).fit()
print(f.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.692
Model:                            OLS   Adj. R-squared:                  0.691
Method:                 Least Squares   F-statistic:                     888.9
Date:                Sun, 19 Apr 2020   Prob (F-statistic):          2.97e-103
Time:                        13:48:26   Log-Likelihood:                -1148.4
No. Observations:                 398   AIC:                             2301.
Df Residuals:                     396   BIC:                             2309.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         46.3174      0.795     58.243      0.000      44.754      47.881
x1            -0.0077      0.000    -29.814      0.000      -0.008      -0.007
==============================================================================
Omnibus:                       40.423   Durbin-Watson:                   0.797
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               56.695
Skew:                           0.713   Prob(JB):                     4.89e-13
Kurtosis:                       4.176   Cond. No.                     1.13e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The output may appear a little overwhelming, but we can just pick what we want:

In [651]:
def mysum(f):
    print(f.summary().tables[1])
    print('R-squared:', round(f.rsquared, 3))
    
mysum(f)
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         46.3174      0.795     58.243      0.000      44.754      47.881
x1            -0.0077      0.000    -29.814      0.000      -0.008      -0.007
==============================================================================
R-squared: 0.692

The essential results of the fit are:

  • column coef contains the beta values
  • the column with the header P>|t| contains the p-values for the beta
  • the R-squared value is about 0.7

Both p-values are highly significant. The R-squared value represents how well the model works in predicting or fitting the data (to be precise, how much of the variation around the mean is explained by the model). This goodness-of-fit is a measure for the whole model, while the p-values are interpreted with respect to the individual beta values.

As a crude rule of thumb, a linear model with R-squared below 0.7 is usually not accepted as a good fit. Note that in the case of the simple regression R-squared is identical to

In [652]:
print(pearsonr(x,y)[0]**2)
0.6917929800341571

Multiple Regression

The logical extension of the simple regression model is the introduction of additional variables which are suspected of having an effect on the dependent variable.

With a package like statsmodels it is very easy to study more elaborate linear models, such as

$$ y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i $$

Let us assume that we are interested in the effect of the weight and the model year on the fuel consumption; it seems reasonably to expect newer cars to be a little more fuel efficient.

When dealing with multiple variables we want named columns. The Pandas package provides this and many other nice features, but unfortunately it also does many things differently from Numpy, which can be quite confusing. This includes the indexing, which is most consistent with Numpy by using the iloc notation.

In [653]:
import pandas as pd
df = pd.read_csv('auto-mpg.data', delimiter='\\s+', na_values='?', 
                 usecols = [0,3,4,6,7], 
                 names = ['mpg', 'hp', 'weight', 'year', 'ori'])
print(df)
      mpg     hp  weight  year  ori
0    18.0  130.0  3504.0    70    1
1    15.0  165.0  3693.0    70    1
2    18.0  150.0  3436.0    70    1
3    16.0  150.0  3433.0    70    1
4    17.0  140.0  3449.0    70    1
..    ...    ...     ...   ...  ...
393  27.0   86.0  2790.0    82    1
394  44.0   52.0  2130.0    82    2
395  32.0   84.0  2295.0    82    1
396  28.0   79.0  2625.0    82    1
397  31.0   82.0  2720.0    82    1

[398 rows x 5 columns]

We now have several independent variables, and we can use e.g. weight and model year for the estimation of the beta values which works in exactly the same way:

In [654]:
f = sm.OLS(df['mpg'], sm.add_constant(df[['weight', 'year']])).fit()
mysum(f)
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -14.1980      3.968     -3.578      0.000     -21.998      -6.398
weight        -0.0067      0.000    -31.161      0.000      -0.007      -0.006
year           0.7566      0.049     15.447      0.000       0.660       0.853
==============================================================================
R-squared: 0.808

The addition of the model year has increased the R-squared value to about 0.8 which is usually taken to indicate a moderately good fit. The individual p-values are still all highly significant.

The beta value for the second independent variable (our model year) is positive, indicating that a higher i.e. later model year is indeed associated with a higher mpg value. In other words, according to this dataset newer car models are more fuel-efficient.

Missing Values

Adding more variables to the model provides additional insights, but unfortunately even this curated dataset is not perfect: it contains missing values. This situation is commonly encountered in data processing and must be addressed.

In our dataset the horsepower is missing for 6 observations.

In [655]:
print(sum(df['hp'].isnull()))
6

There are several approaches to dealing with missing values:

  • let the statistics package deal with it; risky
  • replace with some other value, such as the mean of remaining values in the column
  • drop the observations containing missing values

We will take the third approach. NaN is used in Numpy for not a number; the function dropna() removes rows with missing values from a dataframe.

In [656]:
df = df.dropna()
print(df.shape)
f = sm.OLS(df['mpg'], sm.add_constant(df[['weight', 'year', 'hp']])).fit()
mysum(f)
(392, 5)
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -13.7194      4.182     -3.281      0.001     -21.941      -5.498
weight        -0.0064      0.000    -15.768      0.000      -0.007      -0.006
year           0.7487      0.052     14.365      0.000       0.646       0.851
hp            -0.0050      0.009     -0.530      0.597      -0.024       0.014
==============================================================================
R-squared: 0.808

Note that the number of observations in the summary has gone down to 392. This is the original 398 minus the 6 with missing horsepower values.

Note also that the addition of the horsepower did not improve our model. The R-squared value is unchanged, and the p-value for hp shows that the horsepower is not significant. It should not be included in the model.

Dummy Variables

Looking at the description of the dataset again we find that column number 8 contains an origin code. From comparing model names and origin codes we can establish that

  • 1 means US
  • 2 means Europe
  • 3 means Asia

If we want to study the effect of the origin on the fuel efficiency it would be tempting to simple include that column in the import and perform the linear model as before.

In [657]:
f = sm.OLS(df['mpg'], sm.add_constant(df[['weight', 'year', 'ori']])).fit()
mysum(f)
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -18.0459      4.001     -4.510      0.000     -25.913     -10.179
weight        -0.0060      0.000    -23.588      0.000      -0.006      -0.005
year           0.7571      0.048     15.668      0.000       0.662       0.852
ori            1.1504      0.259      4.439      0.000       0.641       1.660
==============================================================================
R-squared: 0.817

This particular order of the codes actually has a matching order of fuel efficiency: US < Europe < Asia. However, the codes do not need to be assigned in this particular fashion. A more robust solution is the introduction of dummy variables, each indicating the presence or absence of a certain feature.

In [658]:
df = pd.get_dummies(df, columns=['ori'])
print(df)
      mpg     hp  weight  year  ori_1  ori_2  ori_3
0    18.0  130.0  3504.0    70      1      0      0
1    15.0  165.0  3693.0    70      1      0      0
2    18.0  150.0  3436.0    70      1      0      0
3    16.0  150.0  3433.0    70      1      0      0
4    17.0  140.0  3449.0    70      1      0      0
..    ...    ...     ...   ...    ...    ...    ...
393  27.0   86.0  2790.0    82      1      0      0
394  44.0   52.0  2130.0    82      0      1      0
395  32.0   84.0  2295.0    82      1      0      0
396  28.0   79.0  2625.0    82      1      0      0
397  31.0   82.0  2720.0    82      1      0      0

[392 rows x 7 columns]
In [659]:
f = sm.OLS(df['mpg'], sm.add_constant(df[['weight', 'year', 'ori_1', 'ori_2', 'ori_3']])).fit()
mysum(f)
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -12.6825      2.974     -4.265      0.000     -18.529      -6.836
weight        -0.0059      0.000    -22.647      0.000      -0.006      -0.005
year           0.7698      0.049     15.818      0.000       0.674       0.866
ori_1         -5.6244      1.073     -5.244      0.000      -7.733      -3.516
ori_2         -3.6481      0.991     -3.681      0.000      -5.597      -1.700
ori_3         -3.4099      1.048     -3.253      0.001      -5.471      -1.349
==============================================================================
R-squared: 0.819

The dummy variables allow us to see the effect of the origin more clearly: all coefficients are negative, which can be interpreted as a sort of penalty a car gets for coming from that particular region:

  • US means a heavy penalty in terms of fuel efficiency
  • on the other hand, we now see that cars from Europe and Asia are practically identical in terms of fuel efficiency

The summary above is often too detailed; the individual elements in the result can be accessed as well, see the documentation for details:

https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html

In [660]:
print(f.params)
const    -12.682498
weight    -0.005887
year       0.769849
ori_1     -5.624446
ori_2     -3.648140
ori_3     -3.409912
dtype: float64
In [661]:
print(f.pvalues)
const     2.518023e-05
weight    6.515808e-73
year      8.022354e-44
ori_1     2.589445e-07
ori_2     2.649985e-04
ori_3     1.242211e-03
dtype: float64
In [662]:
print(f.rsquared)
0.8190347755921505

EXERCISES:

Find more data files on the UCI site for linear modelling. There are literally hundreds of files waiting to be analysed. Dig into that treasure of data!