In 1958 Frank Rosenblatt presented the perceptron algorithm, which was designed to be implemented in hardware, and hailed as the begin of a new aera of intelligent machines. Sadly, it soon turned out that the approach was severly limited in the type of problems it can learn. However, it provided the foundations for the later developments in neural net models, and it still serves as an excellent example of a simple connectionist approach.
The preceptron uses a weight vector ${\bf w}$ and threshold $b$ to make a 0/1 decision on a numeric input vector ${\bf x}$.
This can be interpreted as whether or not an observation belongs to a certain class. The output $o$ is computed as
$$ o = \begin{cases} 1, & \text{if}\ \sum x_i w_i > b \\ 0, & \text{otherwise} \end{cases} $$The perceptron starts with random weights. It uses supervised learning to update the weights after each observation according to a simple rule:
$$ {\bf w} \leftarrow {\bf w} + l (y - o) {\bf x} $$where $y$ is the correct class of observation ${\bf x}$.
The observations are presented and the learning rule is applied until
Given a suitable choice of
The implementation of the perceptron algorithm is straight-forward.
We will use the Iris dataset:
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Put this file into the current directory of the notebook. On a notebook server you have to first download the file to your computer, and then upload it to the server (using the Upload button on the directory tab in the browser -- the one you see when you connect to the notebook server).
import numpy as np
data = np.genfromtxt('iris.data', delimiter=',', usecols=(0,1,2,3))
print(data[:5,])
print(data.shape)
We construct the X and y vectors using our knowledge of the data file:
X = data[:100,]
y = np.zeros(100)
y[50:] = 1
The first two classes are linearly separable.
This concept is illustrated in the plot below:
import matplotlib.pyplot as plt
plt.plot(X[:50,0], X[:50,1], 'o', c='y')
plt.plot(X[50:,0], X[50:,1], 'o', c='k')
plt.plot([4.5, 6.5], [2.1, 4.6], linewidth=1, c='grey')
plt.show()
We can find a straight line that completely separates the yellow and black dots.
Since this may not be immediately obvious we added such a line.
The plot above only shows columns 0 and 1, but in this dataset it works with all combinations of columns:
c = [[[0,1],[0,2],[0,3]],[[1,2],[1,3],[2,3]]]
fig, axs = plt.subplots(2,3,figsize=(12,10))
for i in range(2):
for j in range(3):
k,l = c[i][j][0], c[i][j][1]
axs[i][j].plot(X[:50,k], X[:50,l], 'o', c='y')
axs[i][j].plot(X[50:,k], X[50:,l], 'o', c='b')
This situation leads us to believe that the perceptron will learn to separate the first two classes in the dataset perfectly. Let us code!
We start by setting the random seed; we may want to make changes later, and we want to see the effects of those changes, not the effects of the randomness in the inital weights.
np.random.seed(seed=3)
The perceptron algorithm is rather easy to implement, and for the perceptron decision rule we define a pretty print function, using the Python lambda notation:
f = lambda x,y: x+y
simply means that f is a function of x and y, and the result is the expression x+y.
pretty = lambda X,w,b: ''.join([ '01'[c*1] for c in np.dot(X,w)>b ])
def percep(X, y, b, l, epochs=5):
w = np.random.rand(4)
for j in range(epochs):
for i in range(X.shape[0]):
o = float(sum(X[i,] * w) > b)
w += l * (y[i] - o) * X[i,]
if j % (epochs/5) == 0:
print(w)
print(pretty(X,w,b))
return w
b = 0.5
w = percep(X, y, b, 0.02)
The decision rule itself is so simple it does not even need a defined function.
np.dot() computes the matrix-vector product i.e. applies the decision rule to all observations:
np.dot(X,w)>b
As mentioned above the perceptron does not fare so well with problems that are not linearly separable, such as classes 2 and 3 in the Iris dataset:
X = data[50:,]
Note that we do not need to change our y vector, we just leave the encoding of the two classes now in X at the same 0/1 values.
Plotting the situation for the first two columns immediately shows the difference here:
plt.plot(X[:50,0], X[:50,1], 'o', c='y')
plt.plot(X[50:,0], X[50:,1], 'o', c='k')
plt.show()
We cannot find a straight line to separate the two classes.
Correspondingly, the perceptron cannot find a perfect solution:
w = percep(X, y, b, 0.01, epochs=5000)
While the perceptron used a rather ad-hoc learning rule the artificial neural net introduces a number of improved features:
The single-layer neural net is very similar to the perceptron as far as the decision rule is concerned:
$$ o = f(\sum w_i x_i) $$For $f$ the traditional choice used to be the sigmoid function:
f = lambda x: 1 / (1 + np.exp(-x))
This function is bounded by [0,1] and has a nice derivative:
$$ f'(x) = f(x)(1 - f(x)) $$Let us plot the sigmoid function:
xs = np.linspace(-5, 5, 30)
plt.plot(xs, f(xs))
plt.show()
Let's first go back to the easy version of our learning problem, the linearly separable case, and initialize the weights randomly:
np.random.seed(1)
X = data[:100,]
w = np.random.rand(4) - 0.5
w
Applying the decision rule shows that we are far from the desired 0/1 vector:
np.dot(X, w)
Our cost function is the sum of squares of the errors over all observations:
$$ \begin{align} h & = \sum w_i x_i \\ o & = f(h) \\ E({\bf w}) & = \frac{1}{2} \sum_n^N (o_n - y_n)^2 \end{align} $$Without going into details here it can be shown that the weight update for $w_i$ can be found via the derivative of the error function:
$$ \Delta w_i = - l (o - y) f'(h) x_i $$When we sum over all observations (or a subset) we get our learning rule:
$$ w_i \leftarrow w_i + \sum_n \Delta w_i $$All the above leads to rather neat code:
def nn1(X, y, l=0.01, epochs=50):
w = np.random.rand(X.shape[1]) - 0.5
for ep in range(epochs):
h = np.dot(X, w)
o = f(h)
w += np.dot(-l * (o - y) * f(h) * (1 - f(h)), X)
if ep % (epochs/5) == 0: print(w, sum((o-y)**2))
return w, sum((o-y)**2)
w, e = nn1(data[:100,], y)
Since our output function can never result in precisely zero or one the sum of errors remains above zero.
But how did we do it?
But the real beauty is in the next statement!
Let's do this slowly, and use only the first five observations i.e. X[:5,] and y[:5]:
h = np.dot(X[:5,], w)
h
o = f(h)
o
Now let's see what the first part of the update line computes:
-l * (o - y[:5]) * f(h) * (1 - f(h))
This is the left part of the learning rule for the first weight $w_0$.
By using the np.dot() function we do
np.dot(-l * (o - y[:5]) * f(h) * (1 - f(h)), X[:5,])
EXERCISES: