Single-Layer Connectionist Models

The Perceptron

In 1958 Frank Rosenblatt presented the perceptron algorithm, which was designed to be implemented in hardware, and hailed as the begin of a new aera of intelligent machines. Sadly, it soon turned out that the approach was severly limited in the type of problems it can learn. However, it provided the foundations for the later developments in neural net models, and it still serves as an excellent example of a simple connectionist approach.

Decision Rule

The preceptron uses a weight vector ${\bf w}$ and threshold $b$ to make a 0/1 decision on a numeric input vector ${\bf x}$.

This can be interpreted as whether or not an observation belongs to a certain class. The output $o$ is computed as

$$ o = \begin{cases} 1, & \text{if}\ \sum x_i w_i > b \\ 0, & \text{otherwise} \end{cases} $$

Learning Rule

The perceptron starts with random weights. It uses supervised learning to update the weights after each observation according to a simple rule:

$$ {\bf w} \leftarrow {\bf w} + l (y - o) {\bf x} $$

where $y$ is the correct class of observation ${\bf x}$.

The observations are presented and the learning rule is applied until

  • all errors $(y-o)$ are zero, meaning that are no more updates, or
  • a given number of steps has been reached.

Given a suitable choice of

  • matrix ${\bf X}$ with one observation per row
  • vector ${\bf y}$ of correct classes
  • a very small learning rate $l$
  • a threshold $b$ e.g. $b = 0.5$ the perceptron weights will converge to a steady state if the problem is linearly separable; this is best shown in an example.

The Iris dataset

The implementation of the perceptron algorithm is straight-forward.

We will use the Iris dataset:

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Put this file into the current directory of the notebook. On a notebook server you have to first download the file to your computer, and then upload it to the server (using the Upload button on the directory tab in the browser -- the one you see when you connect to the notebook server).

In [67]:
import numpy as np

data = np.genfromtxt('iris.data', delimiter=',', usecols=(0,1,2,3))
print(data[:5,])
print(data.shape)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
(150, 4)

We construct the X and y vectors using our knowledge of the data file:

  • the observations come in groups of 50
  • we only use the first two classes
  • in the vector y we encode the first class as 0 and the second class as 1
In [68]:
X = data[:100,]
y = np.zeros(100)
y[50:] = 1

The first two classes are linearly separable.

This concept is illustrated in the plot below:

In [69]:
import matplotlib.pyplot as plt

plt.plot(X[:50,0], X[:50,1], 'o', c='y')
plt.plot(X[50:,0], X[50:,1], 'o', c='k')
plt.plot([4.5, 6.5], [2.1, 4.6], linewidth=1, c='grey')
plt.show()

We can find a straight line that completely separates the yellow and black dots.

Since this may not be immediately obvious we added such a line.

The plot above only shows columns 0 and 1, but in this dataset it works with all combinations of columns:

In [70]:
c = [[[0,1],[0,2],[0,3]],[[1,2],[1,3],[2,3]]]
fig, axs = plt.subplots(2,3,figsize=(12,10))
for i in range(2):
     for j in range(3):
        k,l = c[i][j][0], c[i][j][1]
        axs[i][j].plot(X[:50,k], X[:50,l], 'o', c='y')
        axs[i][j].plot(X[50:,k], X[50:,l], 'o', c='b')

This situation leads us to believe that the perceptron will learn to separate the first two classes in the dataset perfectly. Let us code!

Implementing the Perceptron

We start by setting the random seed; we may want to make changes later, and we want to see the effects of those changes, not the effects of the randomness in the inital weights.

In [71]:
np.random.seed(seed=3)

The perceptron algorithm is rather easy to implement, and for the perceptron decision rule we define a pretty print function, using the Python lambda notation:

f = lambda x,y: x+y

simply means that f is a function of x and y, and the result is the expression x+y.

In [72]:
pretty = lambda X,w,b: ''.join([ '01'[c*1] for c in np.dot(X,w)>b ])

def percep(X, y, b, l, epochs=5):
    w = np.random.rand(4)
    for j in range(epochs):
        for i in range(X.shape[0]):
            o = float(sum(X[i,] * w) > b)
            w += l * (y[i] - o) * X[i,]
        if j % (epochs/5) == 0:
            print(w)
            print(pretty(X,w,b))
    return w

b = 0.5
w = percep(X, y, b, 0.02)
[-0.1092021   0.21014782  0.15290474  0.49682761]
0000111000110001111101111010010111000000100110101011111111111111111111111111111111111111111111111111
[-0.0692021   0.20214782  0.21890474  0.52082761]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
[-0.1712021   0.13214782  0.19090474  0.51682761]
0000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111
[-0.1712021   0.13214782  0.19090474  0.51682761]
0000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111
[-0.1712021   0.13214782  0.19090474  0.51682761]
0000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111
  • The process of going through all observations once is commonly called an epoch.
  • To monitor progress the print statements are executed five times during the run.
    • In case of a high number of epochs we do not want many outputs cluttering our display
    • Therefore we use the modulo operator
  • We see that the perceptron finds suitable weights very quickly.
  • After that the weights no longer change, as $(y-o)$ is always zero.

The decision rule itself is so simple it does not even need a defined function.

np.dot() computes the matrix-vector product i.e. applies the decision rule to all observations:

In [73]:
np.dot(X,w)>b
Out[73]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

As mentioned above the perceptron does not fare so well with problems that are not linearly separable, such as classes 2 and 3 in the Iris dataset:

In [74]:
X = data[50:,]

Note that we do not need to change our y vector, we just leave the encoding of the two classes now in X at the same 0/1 values.

Plotting the situation for the first two columns immediately shows the difference here:

In [75]:
plt.plot(X[:50,0], X[:50,1], 'o', c='y')
plt.plot(X[50:,0], X[50:,1], 'o', c='k')
plt.show()

We cannot find a straight line to separate the two classes.

Correspondingly, the perceptron cannot find a perfect solution:

In [76]:
w = percep(X, y, b, 0.01, epochs=5000)
[ 0.17894695  0.56829309 -0.35341469  0.06524288]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
[-1.24805305 -1.39170691  1.80658531  2.42524288]
0000000000000000101010100001000001100000000000000011111111111111111111111111111111111111111111111111
[-1.58505305 -1.57970691  2.04158531  3.10324288]
0000000000000000001010100000000001100000000000000011111111111111111111111111111111111111111111111111
[-1.71305305 -1.78370691  2.19958531  3.39424288]
0000000000000000000010100000000001000000000000000011111111111111111111111111111111111111111111111111
[-1.74705305 -1.84070691  2.35158531  3.60124288]
0000000000000000101010100001000001100000000000000011111111111111111111111111111111111111111111111111

The Single-Layer Neural Net

While the perceptron used a rather ad-hoc learning rule the artificial neural net introduces a number of improved features:

  • a differentiable activation function instead of a step function
  • an explicit and differentiable error function as measure of cost
  • transforming learning into an optimization problem

The single-layer neural net is very similar to the perceptron as far as the decision rule is concerned:

$$ o = f(\sum w_i x_i) $$

For $f$ the traditional choice used to be the sigmoid function:

In [77]:
f = lambda x: 1 / (1 + np.exp(-x))

This function is bounded by [0,1] and has a nice derivative:

$$ f'(x) = f(x)(1 - f(x)) $$

Let us plot the sigmoid function:

In [78]:
xs = np.linspace(-5, 5, 30)
plt.plot(xs, f(xs))
plt.show()
  • np.linspace() gives us a range over the x-values
  • Note how the curve flattens as we approach -5 and +5

Let's first go back to the easy version of our learning problem, the linearly separable case, and initialize the weights randomly:

In [79]:
np.random.seed(1)
X = data[:100,]
w = np.random.rand(4) - 0.5
w
Out[79]:
array([-0.082978  ,  0.22032449, -0.49988563, -0.19766743])

Applying the decision rule shows that we are far from the desired 0/1 vector:

In [80]:
np.dot(X, w) 
Out[80]:
array([-0.39142541, -0.48499206, -0.374343  , -0.48805477, -0.36109516,
       -0.51768818, -0.3917356 , -0.45514862, -0.46553551, -0.49318143,
       -0.42224247, -0.48854159, -0.45692752, -0.26547283, -0.23937063,
       -0.33244221, -0.31773393, -0.41119215, -0.54484729, -0.39508337,
       -0.58831695, -0.43688256, -0.12794971, -0.64475622, -0.63850727,
       -0.59326698, -0.54467067, -0.44971177, -0.42175566, -0.52430868,
       -0.55463893, -0.52787331, -0.29775033, -0.27038946, -0.49318143,
       -0.34924783, -0.37462805, -0.49318143, -0.3935145 , -0.46344642,
       -0.35290579, -0.57580618, -0.3494496 , -0.56217171, -0.61480436,
       -0.496461  , -0.42530519, -0.41603376, -0.41394467, -0.42719251,
       -2.50200442, -2.37200725, -2.63548294, -2.2061428 , -2.5184234 ,
       -2.36251896, -2.46142086, -1.72510338, -2.46515527, -2.06289778,
       -1.92150811, -2.22461746, -2.21036401, -2.49342158, -1.88229165,
       -2.34917779, -2.34968975, -2.13359473, -2.57573614, -2.08085365,
       -2.53978416, -2.14576735, -2.71789084, -2.47592054, -2.29859398,
       -2.36291244, -2.62352719, -2.73044184, -2.4049134 , -1.84739801,
       -2.04459974, -1.97484443, -2.07315109, -2.76867641, -2.33309415,
       -2.31451789, -2.51891022, -2.47247944, -2.11020201, -2.1620779 ,
       -2.32023295, -2.42140057, -2.1451721 , -1.75543363, -2.22628792,
       -2.14872163, -2.19052082, -2.28199838, -1.58946759, -2.16256471])

Our cost function is the sum of squares of the errors over all observations:

$$ \begin{align} h & = \sum w_i x_i \\ o & = f(h) \\ E({\bf w}) & = \frac{1}{2} \sum_n^N (o_n - y_n)^2 \end{align} $$

Without going into details here it can be shown that the weight update for $w_i$ can be found via the derivative of the error function:

$$ \Delta w_i = - l (o - y) f'(h) x_i $$

When we sum over all observations (or a subset) we get our learning rule:

$$ w_i \leftarrow w_i + \sum_n \Delta w_i $$

Implementation

All the above leads to rather neat code:

In [81]:
def nn1(X, y, l=0.01, epochs=50):
    w = np.random.rand(X.shape[1]) - 0.5
    for ep in range(epochs):
        h = np.dot(X, w)
        o = f(h)
        w += np.dot(-l * (o - y) * f(h) * (1 - f(h)), X)
        if ep % (epochs/5) == 0: print(w, sum((o-y)**2))
    return w, sum((o-y)**2)

w, e = nn1(data[:100,], y)
[-0.32910266 -0.39680449 -0.29589892 -0.14884718] 49.12110221705671
[ 0.22659626 -0.39578413  0.61916581  0.18876119] 13.838564532162886
[-0.07911453 -0.84146369  1.00662965  0.37287464] 2.4222261442560837
[-0.11563792 -0.94091562  1.15123897  0.43818264] 1.5969713632623173
[-0.13810477 -1.01008924  1.25713563  0.48606875] 1.2019073092214625

Since our output function can never result in precisely zero or one the sum of errors remains above zero.

But how did we do it?

  • we compute the inner product h in one go for all observations using np.dot()
  • our function f() also expands automatically to a vector of inputs

But the real beauty is in the next statement!

Let's do this slowly, and use only the first five observations i.e. X[:5,] and y[:5]:

In [82]:
h = np.dot(X[:5,], w)
h
Out[82]:
array([-2.52261096, -1.96231081, -2.27654276, -1.88879586, -2.61311481])
In [83]:
o = f(h)
o
Out[83]:
array([0.07428819, 0.12321718, 0.0930844 , 0.13138183, 0.06829913])

Now let's see what the first part of the update line computes:

In [84]:
-l * (o - y[:5]) * f(h) * (1 - f(h))
Out[84]:
array([-0.01532628, -0.0399352 , -0.02357447, -0.04498013, -0.01303851])

This is the left part of the learning rule for the first weight $w_0$.

By using the np.dot() function we do

  • a vector-matrix multiplication i.e. an inner product on the left part and each column of X,
  • effectively implementing the sum over the observations for each $w_i$ in one fell stroke!
In [85]:
np.dot(-l * (o - y[:5]) * f(h) * (1 - f(h)), X[:5,])
Out[85]:
array([-0.65674767, -0.43526293, -0.19373699, -0.02737092])

EXERCISES:

  • apply both the perceptron and the neural net code to other datasets from the UCI site
  • change some of the parameters and observer what happens
  • use the notebook as a notebook i.e. actually add some markdown cells with notes to your code
  • really try to completely understand the implemention above
In [ ]: