From Perceptrons to Neural Nets

Johann Mitloehner, March 2017

This text provides a short introduction to machine learning using neural nets with some emphasis on data acquisition, preparation, and exploration via reports and plotting for the benefit of data science students. Currently it is still work in progress; if you find any errors please let me know.

Python is used in the examples, and various unique features of this programming language are pointed out for readers unfamiliar with Python, but with some background in other languages such as C or Java.

The Iris Data Set

The UCI ML repository [5] contains a small data set on three species of Iris plants under the URL

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Since we only need to download the file once we use the web browser to download interactively and save to the current directory under the filename iris.data. Then use your favorite text editor to inspect the contents of the file.

$\bigstar$ When processing a data file from any source, always check the contents of the file, either interactively in a text editor or spreadsheet, or within your program. Never rely on the data to correspond to an advertised format.

The data file is not yet in a format suitable for processing: first we need to strip whitespace and split into comma separated fields. The code below reads the data from the file iris.data in the current directory and puts it into a variable irisdata.

Sadly, the file on the UCI site contains some empty lines at the end. The condition if line.strip() omits them.

Note that indexing starts at 0 in Python, just as in C and Java; irisdata[0] denotes the first item.

In [82]:
# Iris data set from UCI ML repository
import numpy as np
irisdata = np.array([ line.strip().split(",") 
                     for line in open("iris.data") if line.strip() ])
print len(irisdata)
print irisdata[0]
print irisdata[50]
print irisdata[100]
150
['5.1' '3.5' '1.4' '0.2' 'Iris-setosa']
['7.0' '3.2' '4.7' '1.4' 'Iris-versicolor']
['6.3' '3.3' '6.0' '2.5' 'Iris-virginica']

There are 50 observations for each species, making a total of 150. We learn from the data description on the UCI website that columns 0 and 2 contain the sepal and petal length of each observation, measured in cm. Columns 1 and 3 contain the corresponding widths which we will ignore here.

Sepals are leaves between the petals of many flowers; they are usually green and serve as support and protection for the petals. The characteristic lengths of sepals and petals can be used for species identification.

We want to generate two data sets, so we define a function to avoid code duplication:

  • sep encodes observations of Iris setosa with code 1 vs 0 for the other species
  • nonsep encodes Iris virginica vs others

We will use Numpy arrays to hold the data, which provides several advantages:

  • we can use the handy Numpy utility functions
  • we get superior performance compared to standard Python arrays
  • if a suitable linear algebra library is present in the Python installation then all cores in a multi-core system are used for matrix-vector multiplications, resulting in an additional performance gain.
In [83]:
# select sepal and petal length from Iris data for given species 
np.random.seed(4217)
def select(data, species1, species2):
    mask = np.array([ irisdata[i][4] in (species1, species2) 
                     for i in range(len(irisdata)) ] )
    X = np.array([ [ 1.0, float(row[0]), float(row[2]) ] for row in irisdata[mask] ])
    y = [ int(row[4] == species1) for row in irisdata[mask] ]
    return X, y

# setosa is linearly separable from the others
sep = select(irisdata, 'Iris-setosa', 'Iris-versicolor')

# virginica and versicolor are not linearly separable 
nonsep = select(irisdata, 'Iris-versicolor', 'Iris-virginica')

Note that Python functions can return more than one result; the select function returns matrix X and vector y in a structure called a tuple.

Matrix X contains rows of observations for sepal and petal lenghts as floating point numbers; for each observation the constant 1.0 is added as the first element, followed by the observed values. The added constant simplifies notation and code in the examples below.

Vector y contains the species group for each observation, coded as 0 and 1. The Numpy function where() achieves this.

We start with the first set; since the structure sep is a two-element tuple we assign X and y simultaneously.

X[:3] denotes the first three lines of matrix X. Conveniently, X[-3:] denotes that last 3 lines.

We can see that each line of X indeed contains one observation; each element of y contains the corresponding species code as 1 or 0. Our data starts with the setosa observations, followed by the others.

$\bigstar$ It is always a good idea to print some samples of the data once it has been prepared for processing, to see whether everything looks the way we intended. In real-world applications a lot more has to be done at this stage to ensure proper data preparation.

In [84]:
X, y = sep
print X[:3]
print y[:3]
print X[-3:]
print y[-3:]
print len(X), len(y)
[[ 1.   5.1  1.4]
 [ 1.   4.9  1.4]
 [ 1.   4.7  1.3]]
[1, 1, 1]
[[ 1.   6.2  4.3]
 [ 1.   5.1  3. ]
 [ 1.   5.7  4.1]]
[0, 0, 0]
100 100

An Example of a Linearly Separable Problem

The concept of linear separability is best introduced with an example in two dimensions, such as our matrix X with values for sepal and petal lengths.

To show that the observations of the setosa species of Iris as identified by their sepal and petal lengths are linearly separable form the other species we use matplotlib to make a scatter plot of the sepal and petal lengths for the two classes; in this case we know that the first 50 data points are for setosa; X[:50,1] denotes column 1 (which is the second column!) of the first 50 rows, and X[50:,1] denotes rows from 50 to last.

We define a function since we will use the code again later for the non-separable problem.

$\bigstar$ Always define a function for code that is used more than once, even if that code is just a few statements.

In [85]:
# plot the two species to show linear separation
%matplotlib inline
import matplotlib.pyplot as plt

def irisplot(X, y):
    for i in range(len(X)): plt.scatter(X[i,1], X[i,2], marker='.x'[y[i]])
    #plt.scatter(X[:50,1], X[:50,2], marker='.')
    #plt.scatter(X[50:,1], X[50:,2], marker='x')
    plt.xlabel('sepal')
    plt.ylabel('petal')
    plt.show()
    
irisplot(X, y)

This shows the classes to be linearly separable: we can a draw straight line to separate the two classes. In fact, we can draw many such lines. The same concept applies to problems with more than two dimensions, with lines being replaced by linear hyperplanes.

Predicting with the Perceptron

For linearly separable problems the Perceptron [1] can learn to identify the class for each observation $\bf x$ by learning a weight vector $\bf w$. By convention vectors are denoted in bold face, while individual elements such as $x_0$ are denoted in normal mathematical font which is usually italics.

We prepared the data so that $x_0=1$ for all observations, $x_1$ is the sepal length in the current observation, and $x_2$ is the petal length. The weights $w_j$ will result from the learning rule and can take on any value.

Once the weights are learned, the computation of the output is a simple dot product followed by a non-linearity:

$$ o = \begin{cases} 1 & \text{if}\ {\bf x} \cdot {\bf w} \gt 0.5 \\ 0 & \text{otherwise} \end{cases} $$

The dot product of vectors ${\bf x}$ and ${\bf w}$ is

$$ {\bf x} \cdot {\bf w} = \sum_j x_j w_j $$

With Numpy's dot() function the performance of these computations is similar to programming the same algorithm in a statically compiled language such as C, or a Just-In-Time compiled language such as Java.

In [86]:
import numpy as np
W = [ [0.1, 0.2, 0.3], [0.4, 0.2, 0.1] ]
x = [ 2, 3, 5 ]
print np.dot(W, x)
[ 2.3  1.9]

The perceptron can also be formulated as predicting classes encoded as -1 and 1, and using 0.0 rather than 0.5 for the non-linearity threshold. The version used here is chosen for consistency with the other approaches discussed later.

The first element $w_0$ of the weight vector is often termed the bias and denoted separately. However, incorporating the bias into the weight vector makes the notation more concise.

The predict() function is used to generate a prediction on the observation $\bf x$ using the weights $\bf w$:

In [87]:
# predict class (0 or 1) of observation x from weight vector w 
def predict(x, w):
    return int(np.dot(x, w) > 0.5 )

Given a suitable weight vector the Perceptron predicts the correct classes for the observations; X[0] is indeed a setosa, and X[50] is a versicolor:

In [88]:
w = [0.01, 0.07, 0.05]

print X[0]
print predict(X[0], w)

print X[50]
print predict(X[50], w)
[ 1.   5.1  1.4]
0
[ 1.   7.   4.7]
1

The Perceptron Learning Rule

The Perceptron starts with weights all zero. During learning the observations are used to provide updates for the weights. In the simple implementation below learning stops after a specified number of epochs.

During each epoch all observations are processed sequentially; for each observation the learning rule provides a small correction for $w_j$ based on the error of target value $y$ versus output $o$:

$$ \Delta w_j = \eta ~ (y - o) ~ x_j $$

In vector notation this becomes

$$ {\Delta\bf w} = \eta ~ (y - o) ~ {\bf x} $$

These values are used to update the weights:

$$ {\bf w} := {\bf w} + {\bf\Delta w} $$

This learning rule is implemented in the code below. Because of the automatic element-wise operations of Numpy arrays we can update the complete weight vector in one statement; looping over the individual elements is not necessary. At the end of each epoch we plot the number of false classifications.

In [89]:
# Perceptron
def percep(dataset, epochs, eta):
    X, y = dataset
    r, c = len(X), len(X[0])
    w = np.zeros(c)
    for ep in range(epochs):
        # for each sample: calculate error, update weights
        for i in range(r):
            w += eta * (y[i] - predict(X[i], w)) * X[i]
        # number of misclassified
        print round(100.0 * sum([ (y[i] != predict(X[i], w)) for i in range(r) ])/len(y)), 
    print ""
    return w

The learning rate $\eta$ has to be chosen wisely, depending on the problem.

In the example below the perceptron converges quickly and separates the two classes with 0 errors.

In [90]:
# convergence after a few epochs
percep(sep, 30, 0.02)
50.0 50.0 50.0 50.0 50.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
Out[90]:
array([ 0.08 ,  0.194, -0.224])

Unfortunately, the domain of linearly separable problems is rather limited, and many practical applications cannot be solved by the perceptron, such as the separation of the Iris species virginica:

In [91]:
X, y = nonsep
print X[:3]
print y[:3]
print X[-3:]
print y[-3:]
print len(X), len(y)
[[ 1.   7.   4.7]
 [ 1.   6.4  4.5]
 [ 1.   6.9  4.9]]
[1, 1, 1]
[[ 1.   6.5  5.2]
 [ 1.   6.2  5.4]
 [ 1.   5.9  5.1]]
[0, 0, 0]
100 100

A scatter plot shows that we cannot draw a straight line to separate the classes:

In [92]:
# these two classes are not linearly separable
irisplot(X, y)

In this case the perceptron will not converge.

In [93]:
# no convergence
percep(nonsep, 500, 0.0001)
50.0 55.0 56.0 56.0 56.0 56.0 56.0 56.0 56.0 56.0 56.0 56.0 56.0 56.0 54.0 54.0 51.0 54.0 55.0 51.0 55.0 51.0 55.0 51.0 52.0 51.0 51.0 51.0 51.0 51.0 52.0 51.0 52.0 51.0 52.0 51.0 52.0 52.0 52.0 52.0 51.0 52.0 52.0 51.0 52.0 52.0 52.0 51.0 52.0 52.0 52.0 52.0 51.0 51.0 51.0 51.0 51.0 52.0 51.0 51.0 51.0 51.0 51.0 51.0 51.0 51.0 51.0 51.0 51.0 50.0 50.0 50.0 50.0 50.0 50.0 49.0 50.0 48.0 50.0 48.0 49.0 48.0 48.0 48.0 47.0 48.0 48.0 47.0 47.0 48.0 46.0 46.0 46.0 46.0 46.0 46.0 46.0 47.0 44.0 46.0 46.0 46.0 46.0 45.0 44.0 44.0 44.0 44.0 45.0 45.0 45.0 45.0 45.0 44.0 44.0 44.0 44.0 44.0 44.0 44.0 44.0 44.0 44.0 44.0 43.0 44.0 44.0 43.0 43.0 44.0 44.0 43.0 42.0 43.0 43.0 42.0 42.0 43.0 43.0 42.0 41.0 41.0 41.0 41.0 43.0 41.0 40.0 39.0 39.0 39.0 39.0 42.0 41.0 40.0 39.0 39.0 41.0 40.0 40.0 39.0 39.0 39.0 40.0 40.0 39.0 39.0 38.0 39.0 39.0 38.0 39.0 38.0 38.0 38.0 38.0 38.0 38.0 38.0 38.0 38.0 38.0 37.0 38.0 38.0 37.0 38.0 38.0 37.0 37.0 38.0 36.0 37.0 35.0 36.0 37.0 35.0 36.0 36.0 34.0 35.0 36.0 36.0 34.0 34.0 36.0 36.0 34.0 34.0 36.0 34.0 34.0 34.0 34.0 36.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 35.0 35.0 35.0 35.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 34.0 33.0 34.0 34.0 34.0 33.0 34.0 34.0 34.0 33.0 34.0 34.0 34.0 33.0 34.0 34.0 33.0 34.0 34.0 34.0 33.0 34.0 34.0 32.0 33.0 33.0 33.0 33.0 33.0 32.0 33.0 33.0 33.0 33.0 32.0 33.0 33.0 33.0 33.0 31.0 31.0 33.0 33.0 33.0 32.0 31.0 33.0 33.0 33.0 31.0 31.0 31.0 32.0 31.0 31.0 31.0 32.0 31.0 31.0 31.0 31.0 31.0 31.0 31.0 31.0 30.0 31.0 31.0 30.0 31.0 31.0 30.0 31.0 31.0 30.0 31.0 30.0 31.0 31.0 30.0 31.0 31.0 30.0 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0 29.0 31.0 29.0 31.0 29.0 31.0 31.0 29.0 31.0 29.0 31.0 29.0 31.0 29.0 31.0 29.0 31.0 29.0 31.0 29.0 31.0 29.0 31.0 29.0 31.0 29.0 30.0 29.0 30.0 29.0 30.0 28.0 29.0 28.0 29.0 29.0 29.0 29.0 28.0 29.0 27.0 29.0 29.0 28.0 29.0 27.0 29.0 27.0 29.0 28.0 28.0 29.0 27.0 29.0 27.0 28.0 28.0 27.0 29.0 27.0 27.0 28.0 27.0 28.0 27.0 28.0 28.0 27.0 27.0 27.0 27.0 28.0 27.0 28.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 26.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 26.0 27.0 27.0 26.0 27.0 27.0 26.0 27.0 26.0 26.0 27.0 26.0 26.0 27.0 26.0 25.0 27.0 26.0 25.0 27.0 26.0 25.0 27.0 26.0 25.0 27.0 26.0 25.0 27.0 26.0 25.0 26.0 26.0 25.0 26.0 26.0 25.0 26.0 26.0 25.0 25.0 26.0 25.0 25.0 26.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 
Out[93]:
array([ 0.1868 ,  0.16909, -0.15967])

The Adaptive Linear Neuron

After the Perceptron was introduced a modified algorithm was formulated that uses an explicit optimisation of a cost function, the Adaptive Linear Neuron [2]. For prediction the non-linearity is applied to the activations in the same manner as with the perceptron, but for the purpose of learning the activation vector ${\bf a}$ is used i.e. the inner product of weight vector ${\bf w}$ and feature vector ${\bf x}$.

A cost function is specified that measures the quality of the solution by taking a sum of squares, similar to the sum of squared errors in OLS linear regression:

$$ J({\bf w}) = 1/2 \sum_i (y^{(i)} - a^{(i)})^2 $$

where $y^{(i)}$ denotes the actual class of observation i, and $a^{(i)}$ denotes the activation based on the the i-th observation. The cost function measures the errors of the $a^{(i)}$ versus the $y^{(i)}$.

This function is minimised with a gradient descent i.e. the weights are updated in the direction that decreases the value of the cost function. This can be done easily since this cost function has a simple partial derivative for $w_j$ which turns out to be

$$ \frac{\partial J}{\partial w_j} = \sum_i (y^{(i)} - a^{(i)}) (-x^{(i)}_j) $$

The weight update is applied on the combined error of ${\it all}$ observations, not after each individual observation as with the Perceptron:

$$ \Delta w_j = \eta \sum_i (y^{(i)} - a^{(i)}) x^{(i)}_j $$

However, since ${\bf X}$ is the matrix containing all feature vectors as rows we can compute the activation $a_i$ as

$$ a_i = {\bf X}_i \cdot {\bf w}$$

where ${\bf X}_i$ denotes line $i$ of matrix ${\bf X}$. This means that we can compute the complete activation vector as

$$ {\bf a} = {\bf X w}$$

The term ${\bf X w}$ denotes the matrix-vector product of matrix $\bf X$ with $r$ rows and $c$ columns, and vector $\bf w$ whose length has to be equal to $c$. The result is a vector of length $r$, the activations for all observations.

Therefore, the complete vector of errors is

$$ {\bf y} - {\bf X} {\bf w} $$

which we need to multiply with the $x_j$ value for each observation and sum over $i$; this is achieved by another matrix-vector product:

$$ {\bf X}^T ({\bf y} - {\bf X} {\bf w}) $$

where ${\bf X}^T$ denotes the transpose of ${\bf X}$. The complete weight update is

$$ {\bf \Delta w} = \eta {\bf X}^T ({\bf y} - {\bf X} {\bf w}) $$

Thanks to Numpy this neat notation translates directly into Python code. The predict() function remains unchanged.

In [94]:
# adaptive linear neuron
def adaline(dataset, epochs, eta):
    X, y = dataset
    r, c = len(X), len(X[0])
    w = np.zeros(c)
    for ep in range(epochs):
        # error = target - activation 
        err = y - np.dot(X, w) 
        # batch update based on all samples
        w += eta * np.dot(X.T, err)
        # predict same as with perceptron
        print '%.0f' % (100.0 * sum([ predict(X[i], w) != y[i] for i in range(r) ])/float(len(y))), 
    print ""
    return w

adaline(nonsep, 500, 0.00001)
50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 51 54 55 55 56 56 56 56 58 58 58 58 59 60 60 60 61 61 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 62 62 62 62 62 62 62 62 62 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 58 58 58 58 58 58 58 57 57 57 57 57 57 57 57 57 57 57 57 57 
Out[94]:
array([ 0.02994556,  0.09131567, -0.0298662 ])

The adaline() algorithm only converges with a suitable choice of learning rate:

In [95]:
# eta too large: gradient descent overshoots minimum and diverges
adaline(nonsep, 200, 0.0002)
58 56 72 64 64 64 64 63 63 63 61 61 61 61 60 61 61 61 61 60 59 59 59 58 57 57 57 56 55 54 53 53 52 52 51 50 49 48 48 48 47 45 45 44 42 42 41 39 39 39 39 39 39 39 39 38 38 37 37 36 35 34 34 34 34 33 33 32 32 32 32 33 33 33 32 32 31 31 30 30 30 29 29 28 28 27 27 27 27 27 27 27 27 27 26 26 26 26 26 25 25 25 25 25 25 25 25 25 25 24 24 24 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 22 16 16 16 16 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 14 14 14 14 14 14 14 14 14 14 14 14 
Out[95]:
array([ 0.16143922,  0.33614873, -0.36645765])
In [96]:
# check convergence for linearly separable problem
adaline(sep, 100, 0.0001)
50 50 50 50 50 50 50 50 50 50 50 50 50 49 49 49 48 46 44 44 42 42 38 38 33 32 27 27 24 23 19 14 13 13 13 13 12 7 6 6 6 2 2 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
Out[96]:
array([ 0.07280359,  0.21961878, -0.28327799])

Stochastic Gradient Descent

In recent years a number of improvements were made to the machine learning approaches discussed here which led to renewed interest in the field under the term Deep Learning. One of these improvements is the realisation that it is not neccessary to run through all observations before an update is made; only a small number of errors is enough to provide an useful update for the gradient search. This technique is called stochastic gradient descent. It provides a huge speedup to the learning process and usually results in similar convergence towards the optimal solution.

In [97]:
# adaptive linear neuron with stochastic gradient descent
def adalinesgd(dataset, epochs=20, eta=0.0001):
    X, y = dataset
    r, c = len(X), len(X[0])
    w = np.zeros(c)
    for ep in range(epochs):
        # shuffle sample 
        for i in np.random.permutation(r):
            # update weights after each sample
            err = y[i] - np.dot(X[i], w) 
            w += eta * X[i] * err 
        # predict same as with perceptron
        print '%.1f' % (100.0 * sum([ predict(X[i], w) != y[i] for i in range(r) ]) / float(len(y))) ,
    print ""
    return w

# linearly separable problem
adalinesgd(sep, 100, 0.0001)
50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 49.0 49.0 49.0 48.0 47.0 45.0 44.0 44.0 42.0 41.0 38.0 37.0 32.0 29.0 27.0 26.0 25.0 23.0 19.0 13.0 13.0 13.0 13.0 12.0 8.0 6.0 6.0 4.0 2.0 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
Out[97]:
array([ 0.07266318,  0.21914333, -0.28269132])
In [98]:
# non-separable problem
adalinesgd(nonsep, 500, 0.0001)
50.0 50.0 51.0 58.0 59.0 62.0 62.0 63.0 62.0 64.0 62.0 62.0 64.0 62.0 62.0 62.0 62.0 62.0 62.0 62.0 62.0 64.0 61.0 65.0 61.0 61.0 61.0 62.0 62.0 61.0 60.0 61.0 61.0 61.0 62.0 61.0 61.0 61.0 60.0 60.0 59.0 59.0 59.0 59.0 59.0 59.0 58.0 59.0 57.0 58.0 57.0 55.0 55.0 55.0 56.0 56.0 57.0 54.0 53.0 56.0 56.0 52.0 53.0 52.0 53.0 52.0 51.0 54.0 53.0 51.0 51.0 47.0 50.0 50.0 49.0 49.0 50.0 48.0 48.0 48.0 47.0 47.0 47.0 47.0 46.0 43.0 45.0 45.0 43.0 42.0 43.0 41.0 45.0 42.0 42.0 42.0 40.0 39.0 42.0 42.0 39.0 39.0 39.0 40.0 37.0 37.0 37.0 40.0 37.0 38.0 39.0 39.0 37.0 39.0 38.0 38.0 38.0 37.0 38.0 37.0 37.0 36.0 36.0 36.0 35.0 36.0 35.0 34.0 35.0 35.0 34.0 34.0 34.0 33.0 33.0 33.0 32.0 34.0 32.0 32.0 34.0 32.0 33.0 31.0 32.0 32.0 32.0 32.0 32.0 32.0 32.0 31.0 29.0 30.0 30.0 31.0 29.0 30.0 29.0 29.0 30.0 29.0 30.0 31.0 31.0 29.0 29.0 29.0 29.0 31.0 29.0 27.0 28.0 27.0 28.0 27.0 27.0 26.0 27.0 28.0 27.0 27.0 26.0 27.0 26.0 26.0 26.0 26.0 26.0 25.0 26.0 26.0 26.0 26.0 26.0 25.0 25.0 25.0 25.0 25.0 25.0 24.0 25.0 24.0 24.0 26.0 24.0 24.0 25.0 24.0 24.0 25.0 24.0 26.0 25.0 26.0 24.0 24.0 24.0 24.0 24.0 24.0 24.0 24.0 25.0 25.0 25.0 24.0 24.0 23.0 23.0 24.0 23.0 23.0 22.0 23.0 23.0 23.0 23.0 22.0 23.0 22.0 22.0 22.0 22.0 23.0 17.0 17.0 23.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0 25.0 16.0 16.0 24.0 22.0 22.0 15.0 22.0 22.0 15.0 16.0 16.0 14.0 16.0 15.0 21.0 16.0 16.0 16.0 13.0 16.0 16.0 16.0 16.0 14.0 13.0 15.0 16.0 16.0 16.0 15.0 14.0 13.0 16.0 13.0 13.0 15.0 13.0 16.0 15.0 15.0 13.0 14.0 16.0 16.0 16.0 15.0 15.0 15.0 15.0 15.0 15.0 16.0 15.0 15.0 15.0 15.0 15.0 14.0 14.0 15.0 14.0 13.0 14.0 14.0 13.0 13.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 14.0 14.0 15.0 15.0 15.0 15.0 15.0 14.0 15.0 14.0 15.0 15.0 13.0 14.0 15.0 15.0 15.0 15.0 14.0 13.0 14.0 15.0 14.0 14.0 14.0 15.0 15.0 15.0 14.0 15.0 15.0 14.0 15.0 15.0 14.0 14.0 14.0 14.0 14.0 13.0 14.0 15.0 14.0 14.0 14.0 14.0 14.0 14.0 14.0 14.0 13.0 13.0 14.0 13.0 14.0 14.0 13.0 14.0 13.0 13.0 14.0 14.0 14.0 13.0 13.0 13.0 14.0 14.0 14.0 14.0 14.0 14.0 14.0 14.0 14.0 13.0 14.0 14.0 13.0 14.0 14.0 14.0 14.0 14.0 14.0 14.0 13.0 14.0 14.0 12.0 14.0 14.0 14.0 14.0 13.0 14.0 13.0 13.0 14.0 13.0 13.0 14.0 14.0 14.0 14.0 13.0 14.0 14.0 14.0 14.0 13.0 13.0 14.0 14.0 13.0 13.0 14.0 12.0 13.0 13.0 14.0 14.0 14.0 14.0 14.0 13.0 14.0 13.0 13.0 14.0 13.0 14.0 13.0 14.0 14.0 14.0 13.0 13.0 13.0 13.0 13.0 12.0 14.0 12.0 11.0 11.0 13.0 14.0 11.0 14.0 14.0 13.0 11.0 14.0 12.0 12.0 14.0 11.0 13.0 13.0 14.0 11.0 11.0 
Out[98]:
array([ 0.19186579,  0.38515032, -0.43342527])

Logistic Regression Classifier

A single-layer net with a sigmoid output function is a logistic regression classifier. The sigmoid function is

$$ f(z) = \frac{1}{1 - e^{-z}} $$

As error function we take the sum of squares over all observations for target $y_i$ minus output $o_i$:

$$ E = \frac{1}{2} \sum_{i=1}^N (y_i - o_i)^2 ~~~~ \text{with} ~~~ o_i = f(x_i\cdot w) $$

Since $f'(z) = f(z) (1 - f(z))$ the partial derivative with respect to the weight $w_j$ is

$$\frac{\partial E}{\partial w_j} = \sum_{i=1}^N f(x_i\cdot w) (1 - f(x_i\cdot w)) (y_i - o_i) x_{i,j} $$

and the learning rule

$$ \Delta w_j = -\eta \frac{\partial E}{\partial w_j} $$

We compute the sum over all observations as the dot-product of the transpose $X^T$ which results in concise and efficient code:

In [99]:
def sigout(x, w):
    return 1.0 / (1.0 + np.exp(-1.0 * np.dot(x, w)))

def signet(dataset, epochs, eta=0.001):
    X, y = dataset
    r, c = len(X), len(X[0])
    w = np.zeros(c)
    for ep in range(epochs):
        out = sigout(X, w)
        dsigout = out * (1 - out)
        w -= eta * np.dot(X.T, dsigout * (out - y))
        print '%.0f' % (np.sum(np.round(out) != y)) ,
    print ""
    return w

print 'sep:'
signet(sep, 100)
print 'nonsep:'
signet(nonsep, 200)
sep:
50 50 50 50 50 50 50 50 50 50 50 50 50 48 46 43 40 28 17 12 6 3 2 2 2 2 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
nonsep:
50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 49 49 49 49 49 49 49 48 48 48 48 48 48 48 48 47 47 47 47 46 46 45 45 45 43 43 42 41 40 39 39 38 38 37 37 37 37 37 36 35 35 34 34 33 33 32 32 32 32 32 31 30 30 30 29 29 27 27 27 27 27 26 24 24 24 24 24 24 23 22 22 22 22 22 22 22 22 20 20 20 20 20 20 20 20 20 19 19 18 18 18 18 18 18 18 18 18 18 18 18 18 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 16 15 15 15 15 15 15 15 15 15 15 14 14 14 14 14 14 14 12 12 12 12 12 12 12 12 12 
Out[99]:
array([ 0.1819682 ,  0.39337021, -0.56350111])

MNIST dataset

The training part of the MNIST dataset contains 50000 28x28 images of hand-written digits encoded as gray values from 0 to 1. The pickle file contains a 28x28 = 784 floating point numbers in each image encoding, and the corresponding list of integer labels. There are 10000 additional images for validation, and another 10000 for testing.

A two-layer neural net can reach any error rate > 0 on the training data with enough hidden units. The practical interest lies in the generalization i.e. did the net learn 'by rote' or did it actually capture the concept, in this case the shape of digits. Therefore, a separate dataset is used for validation. However, since the parameters of the method (such as number of hidden units and learning rate) are themselves set according to the results on the validation set, another dataset is used for the final test. Obviously this only makes sense when there are no more parameter changes after the testing phase. In this text we will not fiddle with parameters and go directly to the test set.

In [100]:
import cPickle, gzip
f = gzip.open('mnist.pkl.gz', 'rb')
# the MNIST set contains training, testing, AND validation data
mtrain, mvalid, mtest = cPickle.load(f)
f.close()

img, lab = mtrain
print len(img)
# print first image data > 0
print [ round(x,2) for x in img[0] if x > 0 ], lab[0]
50000
[0.01, 0.07, 0.07, 0.07, 0.49, 0.53, 0.68, 0.1, 0.65, 1.0, 0.96, 0.5, 0.12, 0.14, 0.37, 0.6, 0.66, 0.99, 0.99, 0.99, 0.99, 0.99, 0.88, 0.67, 0.99, 0.95, 0.76, 0.25, 0.19, 0.93, 0.99, 0.99, 0.99, 0.99, 0.99, 0.99, 0.99, 0.99, 0.98, 0.36, 0.32, 0.32, 0.22, 0.15, 0.07, 0.86, 0.99, 0.99, 0.99, 0.99, 0.99, 0.77, 0.71, 0.96, 0.94, 0.31, 0.61, 0.42, 0.99, 0.99, 0.8, 0.04, 0.17, 0.6, 0.05, 0.0, 0.6, 0.99, 0.35, 0.54, 0.99, 0.74, 0.01, 0.04, 0.74, 0.99, 0.27, 0.14, 0.94, 0.88, 0.63, 0.42, 0.0, 0.32, 0.94, 0.99, 0.99, 0.46, 0.1, 0.18, 0.73, 0.99, 0.99, 0.59, 0.11, 0.06, 0.36, 0.98, 0.99, 0.73, 0.97, 0.99, 0.97, 0.25, 0.18, 0.51, 0.71, 0.99, 0.99, 0.81, 0.01, 0.15, 0.58, 0.89, 0.99, 0.99, 0.99, 0.98, 0.71, 0.09, 0.45, 0.86, 0.99, 0.99, 0.99, 0.99, 0.79, 0.3, 0.09, 0.26, 0.83, 0.99, 0.99, 0.99, 0.99, 0.77, 0.32, 0.01, 0.07, 0.67, 0.86, 0.99, 0.99, 0.99, 0.99, 0.76, 0.31, 0.04, 0.21, 0.67, 0.88, 0.99, 0.99, 0.99, 0.99, 0.95, 0.52, 0.04, 0.53, 0.99, 0.99, 0.99, 0.83, 0.53, 0.52, 0.06] 5

The dataset is not sorted by label:

In [101]:
print lab[:100]
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 9 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]

To apply our previously defined functions we want only two classes: we pick cases with label 0 or 1 from the dataset.

In order to verify that the data indeed encodes images of handwritten digits we use a Python idiom to print images using ASCII characters by showing values greater than 0.5 as asterisks:

In [102]:
# pick cases with label 0 or 1 
ix = [ i for i in range(len(lab)) if lab[i] in (0, 1) ]
X = np.array(img[ix])
y = np.array(lab[ix])
onezero = (X, y)

print "rows & cols in train set:", len(X), len(X[0]), "   first label:", y[0]

# define a function to avoid code duplication
def asciiplot(x):
    print "\n".join( [ " ".join([ ".*"[int(x[j + i*28] > 0.5)] for j in range(28) ]) for i in range(28) ] )

asciiplot(X[0])
rows & cols in train set: 10610 784    first label: 0
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . * * * . . . . . . . . .
. . . . . . . . . . . . . . . * * * * * . . . . . . . .
. . . . . . . . . . . . . . * * * * * * . . . . . . . .
. . . . . . . . . . . . . * * * * * . * * . . . . . . .
. . . . . . . . . . . * * * * * * * . * * * . . . . . .
. . . . . . . . . . . * * * * . * * . . * * . . . . . .
. . . . . . . . . . * * * * . . . . . . * * . . . . . .
. . . . . . . . . * * * * . . . . . . . * * * . . . . .
. . . . . . . . * * * . . . . . . . . . * * * . . . . .
. . . . . . . . * * . . . . . . . . . . * * * . . . . .
. . . . . . . * * * . . . . . . . . . . * * * . . . . .
. . . . . . . * * . . . . . . . . . . . * * * . . . . .
. . . . . . . * * . . . . . . . . . . * * * . . . . . .
. . . . . . . * * . . . . . . . . . * * * . . . . . . .
. . . . . . . * * . . . . . . . . * * * . . . . . . . .
. . . . . . . * * . . . . . . . * * * . . . . . . . . .
. . . . . . . * * * . . . * * * * * . . . . . . . . . .
. . . . . . . * * * * * * * * * * . . . . . . . . . . .
. . . . . . . * * * * * * * * . . . . . . . . . . . . .
. . . . . . . . . * * * * . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

The images have been centered, so the hand-written lines for zeros and ones occupy different areas with not too much overlap.

In [103]:
asciiplot(X[1])
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . * * . . . . . . .
. . . . . . . . . . . . . . . . . . * * * . . . . . . .
. . . . . . . . . . . . . . . . . . * * * . . . . . . .
. . . . . . . . . . . . . . . . . * * * . . . . . . . .
. . . . . . . . . . . . . . . . * * * . . . . . . . . .
. . . . . . . . . . . . . . . * * * * . . . . . . . . .
. . . . . . . . . . . . . . . * * * . . . . . . . . . .
. . . . . . . . . . . . . . * * * . . . . . . . . . . .
. . . . . . . . . . . . . . * * * . . . . . . . . . . .
. . . . . . . . . . . . . * * * . . . . . . . . . . . .
. . . . . . . . . . . . * * * * . . . . . . . . . . . .
. . . . . . . . . . . * * * * . . . . . . . . . . . . .
. . . . . . . . . . . * * * * . . . . . . . . . . . . .
. . . . . . . . . . * * * * . . . . . . . . . . . . . .
. . . . . . . . . . * * * . . . . . . . . . . . . . . .
. . . . . . . . . * * * * . . . . . . . . . . . . . . .
. . . . . . . . . * * * * . . . . . . . . . . . . . . .
. . . . . . . . . * * * . . . . . . . . . . . . . . . .
. . . . . . . . . * * * . . . . . . . . . . . . . . . .
. . . . . . . . . * * * . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

The adaptive linear neuron learns these patterns very quickly and achieves about 1% error rate:

In [104]:
print len(onezero[0])
w = adalinesgd(onezero, 10, 0.0001)
10610
1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 

In a similar way to the ASCII plots above we display the weights. Clearly there are areas of negative weights that detect zeros and positive weight areas that detect ones; the remaining areas show very small weights and are effectively ignored.

In [105]:
lw = 0.001
print len([ x for x in w if abs(x) > lw ]), "weights with abs(w) >", lw
wd = np.array([ 'o' for i in range(len(w)) ])
wd[ (w < 0) & (abs(w) > lw) ] = '-'
wd[ (w > 0) & (abs(w) > lw) ] = '+'
print "\n".join( [ " ".join([ wd[j + i*28] for j in range(28) ]) for i in range(28) ] )
382 weights with abs(w) > 0.001
o o o o o o o o o o o o o o o o o o o o o o o o o o o o
o o o o o o o o o o o o o o o o o o o o o o o o o o o o
o o o o o o o o o o o o o + + + + o o o o o o o o o o o
o o o o o o o o o o o + + + + + + + + + + o o o o o o o
o o o o o o o o o + + + + + + o + + + o + + + + o o o o
o o o o o o o o + + + + + + + + + + + + + + + + + o o o
o o o o o o o o + + + + + + + + + + o + + + + + + o o o
o o o o o o o o o - - o + o + + + + + + + + o - o o o o
o o o o o o o o o - - o + + + + + + + + o - - - - o o o
o o o o o o - o o - - + + + + o - + + + - - - - - o o o
o o o o o o - - - - - + + o + + + o - o - - - - - o o o
o o o o o o - - - - o + o - + + o - - - - - - - - o o o
o o o o o - - - - - o - - - + + - - - - - - - - - o o o
o o o o o - - - - - - - - - + + - - - - - - - - - o o o
o o o o o - - - - - - - - + + + - - - - - - - - - o o o
o o o o - - - - - - - - - + + + - - - - - - - - - o o o
o o o o - - - - - - - - + + + - - - - - - - - - - o o o
o o o o - - - - - - - + + + + - - - - - - - - - - o o o
o o o o - - - - - o + + + + + + + - - - - - - - - o o o
o o o o - - - - o + + + o - + + + + o - - - - - o o o o
o o o o - - o + + + + + o - + + + + + + o - - o + o o o
o o o o o o + + + + + + o + + + + + + + + + o o + o o o
o o o o + + + + + + + + - - + + + + + + + + o o o o o o
o o o o + + + + + o - - + - - o + + + + + + o o o o o o
o o o o o + + + + + + + + + + + + + + + + + o o o o o o
o o o o o o o o + o o o + + + + o o o o o o o o o o o o
o o o o o o o o o o o o o o o o o o o o o o o o o o o o
o o o o o o o o o o o o o o o o o o o o o o o o o o o o

Using Python's mathplotlib we can create a color plot showing the weights as hot and cold areas:

In [106]:
import matplotlib as mpl
import matplotlib.cm as cm

def plotw(w):
    wm = [ [ w[i*28+j] for j in range(28) ] for i in range(28) ]
    fig, ax = plt.subplots(figsize=(2, 2))
    im = ax.imshow(wm, vmin=-0.001, vmax=0.001 )
    fig.colorbar(im)
    plt.show()
    
plotw(w)

Linear Softmax Classifier

Neural nets with only one output value are too limited for practical applications. A common technique to predict one of several classes for an observation is the softmax classifier.

The net produces output scores for each class using the dot product on inputs and weights. The resulting scores $f$ are interpreted as unnormalized log probabilities.

The cross-entropy loss function for observation $i$ is

$$ L_i = - \log (\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}) $$

Since $y_i$ holds the correct class the expression inside the log is the (normalized) probability of the correct class. When the correct class probability goes towards 1 the loss goes towards 0 since $log(1) = 0$.

Define $p$ as the vector of normalized probabilities with

$$ p_k = \frac{e^{f_k}}{\sum_j e^{f_j}} $$

Then we have

$$ \frac{\delta L_i}{\delta f_k} = p_k - 1(y_i = k) $$

where $1(y_i = k)$ is 1 if $y_i = k$ and 0 otherwise.

This leads to very concise code in the implementation. First we define a function for the normalized probabilities:

In [107]:
def softscor(X, W):
    scor = np.dot(X, W)
    escor = np.exp(scor)
    return escor / np.sum(escor, axis=1, keepdims=True)

For the class prediction we use the argmax() function which returns the index of a vector's maximum:

In [108]:
def softpred(x):
    return [ np.argmax(pr) for pr in x ]

We print the error rate with more precision since the 0/1 problem is very easy.

In [109]:
# % false classifications
def erate(scor, y):
    return '%.1f' % ((100.0 * sum(softpred(scor) != y )) / len(y))

In the code below we determine the number of classes from the parameter y by converting the vector elements into a set and taking the size of that set, in other words, the number of different values in y.

Also, note the expression for the gradient which implements $p_k - 1(y_i = k)$.

In [110]:
def linsoftmax(dataset, stepsize=0.5, steps=30):
    X, y = dataset
    D = len(X[0])
    # determine number of classes from set of target values
    K = len(set(y)) 
    W = 0.01 * np.random.randn(D,K)
    for step in range(steps):
        probs = softscor(X, W)
        if (step % 10) == 0: print erate(probs, y),  
        # gradient
        dscor = probs
        dscor[range(len(y)), y] -= 1
        dscor /= len(y)
        dW = np.dot(X.T, dscor)
        # regularization
        dW += 0.001 * W
        # parameter update in the negative gradient direction to decrease loss
        W += -stepsize * dW
    print '\n% error on training set:', erate(softscor(X, W), y)
    return W

W = linsoftmax(onezero)
print W.shape
38.7 0.4 0.3 
% error on training set: 0.3
(784, 2)

The error on the 0/1 training set is very low. However, as the code above can deal with any number of outputs we can now proceed to the full MNIST challenge:

In [111]:
print 'Full MNIST:'
X, y = mtrain
print 'len(y):', len(y)
print 'first 30 y: ', y[:30]

print 'training linsoftmax:'
Wlin = linsoftmax(mtrain, steps=100)
print Wlin.shape
Full MNIST:
len(y): 50000
first 30 y:  [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7]
training linsoftmax:
89.2 16.4 14.2 13.3 12.6 12.1 11.7 11.5 11.2 11.0 
% error on training set: 10.8
(784, 10)

As expected the error on the full digit classification remains rather high. We introduce a different architecture to tackle the problem.

Two-Layer Softmax Classifier

With a hidden layer the network has the theoretical power to approximate any function to any error > 0, given enough units in the hidden layer. The learning rule uses back-propagation to pass the errors on starting at the output layer.

This implementation uses ReLU units to compute the activations for the hidden layer. The Rectified Linear Unit provides a non-linearity:

$$ r(x) = \begin{cases} x & \text{if}\ x \gt 0 \\ 0 & \text{otherwise} \end{cases} $$

ReLU units are frequently used instead of sigmoid units in multi-layer architectures as they avoid the vanishing gradient problem in many applications.

In [112]:
def netpred(X, W, W2, mode="relu"):
    hid = np.maximum(0, np.dot(X, W)) 
    escor = np.exp(np.dot(hid, W2))   
    return escor / np.sum(escor, axis=1, keepdims=True), hid

# introduce hidden layer, SGD
def netsoftmax(dataset, h, stepsize=0.5, steps=50, mode="relu"):
    X_, y_ = dataset
    reg = 0.001
    D = len(X_[0])
    K = len(set(y_)) # number of classes
    W = 0.1 * np.random.randn(D,h)
    W2 = 0.1 * np.random.randn(h, K)
    for step in range(steps):
        # SGD
        ix = np.random.choice(len(y_), min(200, len(y_)), replace=False)
        X, y = X_[ix], y_[ix]
        probs, hid = netpred(X, W, W2, mode)
        if (step % 100) == 0: print erate(probs, y) , 
        # gradient
        dscor = probs
        dscor[range(len(y)), y] -= 1
        dscor /= len(y)
        # backprop to hidden layer
        dW2 = np.dot(hid.T, dscor)
        dhid = np.dot(dscor, W2.T)
        dhid[hid <= 0] = 0
        dW = np.dot(X.T, dhid)
        dW += reg * W
        dW2 += reg * W2
        # parameter update in the negative gradient direction to decrease loss
        W += -stepsize * dW
        W2 += -stepsize * dW2
    # print % false classifications on complete X
    probs, hid = netpred(X_, W, W2)
    print '\nerror on training set: ', erate(probs, y_)
    return W, W2

print 'training netsoftmax with ReLU units in layer 1:'
W, W2 = netsoftmax(mtrain, 1000, steps=500, stepsize=0.5)
training netsoftmax with ReLU units in layer 1:
96.0 9.0 4.5 5.5 3.0 
error on training set:  3.1

The error rate on the training set is not representative for the real-world performance of the net. We need to compute the error rate on data the net has not been trained on.

For this reason the MNIST data is split into several sets. To evaluate the performance we use the weights learned from the training set and measure the error rate on the validation set.

In [113]:
print('Validating full MNIST..')
X, y = mvalid

probs = softscor(X, Wlin)
print 'linsoftmax error on validation set: ', erate(probs, y)

probs, hid = netpred(X, W, W2)
print 'netsoftmax error on validation set:  ', erate(probs, y)
Validating full MNIST..
linsoftmax error on validation set:  9.6
netsoftmax error on validation set:   3.6

Training set, Test set, Validation set

In our example we proceeded from the training set to the validation set and stopped there; we did not adjust any parameters, such as the number of hidden units, to improve the performance on the validation set. If we do that we need to introduce an additional step:

  • the net is trained on the training set
  • the performance is measured on the validation set
  • steps 1 and 2 are repeated and parameters such as the net architecture are adjusted to achieve better performance; therefore, the performance on the validation set is optimized towards the particular content of that set. To get a better estimate for the real-world performance of the net another step is added:
  • the final evaluation is done on the test set. This is the error rate reported.

Note that this process is only meaningful if there are no more adjustments after the evaluation on the test set.

References

[1] F. Rosenblatt. The perceptron, a perceiving and recognizing automaton (Project Para). Report 85-460-1, Cornell Aeronautical Laboratory, 1957.

[2] B. Widrow et al. Adaptive ”Adaline” neuron using chemical ”memistors”. Technical Report 1553-2, Stanford Electronics Laboratories, 1960.

[3] S. Raschka. Single-Layer Neural Networks and Gradient Descent. http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html

[4] Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/neural-networks-case-study/

[5] UCI Machine Learning Repository, Iris Data Set. https://archive.ics.uci.edu/ml/datasets/Iris