The idea of connectionist computing was inspired by the neuron in the human brain and started with the Perceptron in the late 1950s. After inital high hopes progress in the field slowed down somewhat. In recent years a number of discoveries led to enewed interest in artificial neural nets and to the field of deep learning which quickly achieved amazing results in many practical applications.
The perceptron is still a good model for the general idea of connectionist computing. In its simplest form the Perceptron makes a yes/no decision based on a number of numeric inputs. These values are multiplied by weights and summed; the output is derived from whether the sum exceeds a defined threshold t. The answer can be taken as whether or not the input belongs to a certain class.
$$ \begin{equation} o = \begin{cases} 1, & \text{if}\ \sum x_i w_i > t \\ 0, & \text{otherwise} \end{cases} \end{equation} $$In vector notation we can write this more concisely; the inner product (or dot product) combines the element-wise multiplication and the summation.
$$ o = {\bf x} \cdot {\bf w} > t $$The weights are initialized randomly and then updated after each observation according to a learning rule. For certain types of problems the perceptron weights will converge, given suitable parameters.
To train the perceptron the output $o$ is computed for each observation individually; it is compared with the proper answer $y$ to derive an update for the weights:
$$ {\bf \Delta w} = l (y - o) {\bf x} $$If the learning rate $l$ is chosen with care then the perceptron will converge to a final weight vector ${\bf w}$, provided that the problem is linearly separable i.e. we can draw a line or hyperplane in the inputs ${\bf X}$ that partitions them into the two classes. This is best shown with an example.
The Iris dataset is often used to illustrate machine learning methods, and with good reason. It is a freely available dataset of moderate size that contains observations on three classes, species of Iris flowers.
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Download the file iris.data and put it into the same directory as the notebook.
We will use the Python package Numpy to import the data and the package matplotlib for plotting.
import numpy as np
data = np.genfromtxt('iris.data', delimiter=',', usecols=(0,1,2,3))
print(data[:5,])
print(data.shape)
There are 150 observations with 4 values each. The last column in the data file contains the name of the species; we will encode that separately.
To illustrate the perceptron learning rule we will first focus on the species of Iris that are linearly separable, in this case the first 100 observations.
X = data[:100,]
We construct the proper classes from the knowledge that in this dataset the first 50 observations belong to class 0, and the next 50 to class 1.
y = np.zeros(100)
y[50:] = 1
To illustrate to idea of the data being linearly separable we plot these observations.
import matplotlib.pyplot as plt
plt.plot(X[:50,0], X[:50,1], 'o', c='b')
plt.plot(X[50:,0], X[50:,1], 'o', c='r')
plt.show()
plt.plot(X[:50,0], X[:50,1], 'o', c='b')
plt.plot(X[50:,0], X[50:,1], 'o', c='r')
plt.plot([4.5, 6.5], [2.1, 4.6], linewidth=1, c='grey')
plt.plot([4.6, 6.5], [2.1, 4.7], linewidth=1, c='grey')
plt.show()
Now we initialize the weights randomly.
np.random.seed(seed=1)
w = np.random.rand(4)
We set the threshold to 0.5, but other values such as 0 or 1 would also work, since with corresponding weights we can push the sum towards any given threshold.
t = 0.5
We are now ready to apply the perceptron decision rule to the first observation i.e. the first row in matrix ${\bf X}$.
To get a numeric value from the output we cast the boolean to float.
o = float(np.dot(X[0,], w) > t)
print(o)
print(y[0])
Since the correct answer for the first observation should be 0 we have an error.
The learning rule describes the steps to train the perceptron; since these will be repeated many times we define a function.
def percep(X, y, t, steps, l):
w = np.random.rand(4)
for j in range(steps):
for i in range(X.shape[0]):
o = float(sum(X[i,] * w) > t)
w += l * (y[i] - o) * X[i,]
return w
w = percep(X, y, t, 10, 0.02)
print(w)
The perceptron has now been trained, and we can apply the weights to the input data.
print((np.dot(X, w)>t)*1.0)
The two classes are identified perfectly. Sadly, the perceptron cannot perform in this manner if the problem is not linearly separable.
While the first two classes in the dataset are linearly separable, the last two are not. We can see this immediately by plotting; here we only plot the first and second column, but other combinations result in similar pictures.
We put the last two classes in our X matrix of observations. The y vector of correct classes does not need to be changed, we just treat the new classes coded in the same way, by 0 and 1.
X = data[50:,]
plt.plot(X[:50,0], X[:50,1], 'o', c='b')
plt.plot(X[50:,0], X[50:,1], 'o', c='r')
plt.show()
When we try to apply the perceptron algorithm to this task it cannot succeed completely, even with a very high number of epochs and a very small learning rate.
w = percep(X, y, t, 1000, 0.001)
print(w)
print((np.dot(X, w)>t)*1.0)
Since many practical problems fall in the later class a more sophisticated approach is needed. However, it builds on the ideas expressed in the perceptron.
This simplest case of an artificial neural net is very close to a perceptron.
The artificial neural net is typically illustrated with an image like the following:
This looks nice, but it is not particularly helpful. It is actually more practical to use mathematical notation.
For the computation on
We have $N$ observations in the dataset which means we have $N$ values for $\epsilon$:
$$ \epsilon_n = f(x_{n,1} w_1 + x_{n,2} w_2 + x_{n,3} w_3) - t_n $$☆ Note that
Instead of using a specific given learning rule as in the perceptron we transform the learning task into an optimization problem:
We define a cost function $E({\bf \epsilon})$ that assigns a single cost value to the list of errors ${\bf \epsilon}$.
This allows us to choose among diffent optimization methods for finding the weights $w_i$ that minimize $E$.
The sigmoid function is a common choice for the activation function.
We use the Python lambda syntax for the function definition here, it brings the code closer to the notation.
f = lambda x: 1. / (1. + np.exp(-x))
xs = np.arange(-5, +5, .01)
plt.plot(xs, f(xs), lw=1)
plt.show()
We choose a cost function that sums the squares of the errors over all observations:
$$ E = \sum_n^N \frac{1}{2} (a_n - t_n)^2 $$This particular function has some desirable properties:
The last point allows us to find the gradient of the error function.
The derivative of a function $f$ is the change in $f(x)$ caused by a small change in $x$.
In the plot below the function and its derivative are
$$ \begin{align} g(x) & = 2 x^4 + x^3 - 3 x^2 + 1 \\ g'(x) & = 8 x^3 + 3 x^2 - 6 x \end{align} $$Remember the derivative of polynomials:
g = lambda x: 2*x**4 + x**3 - 3*x**2 + 1
g_ = lambda x: 8 * x**3 + 3 * x**2 - 6 * x
xs = np.arange(-1.6, +1.4, 0.01)
ys = g(xs)
plt.plot(xs, ys, lw=1)
plt.plot(xs, g_(xs), '--')
x0 = -1.5
y0 = g(x0)
dx = 0.05
dy = g_(x0)
plt.arrow(x0, y0, dx, dy*dx, linewidth=5)
plt.text(-1.5, 1.3, '$x_0$')
x1 = 1.2
plt.arrow(x1, g(x1), -dx, -dx*g_(x1), linewidth=5)
plt.text(1.25, 2.8, '$x_1$')
plt.ylim(-2, 5)
plt.show()
The arrow at $x_0$ illustrates the direction towards descent:
By using small steps in the direction of the arrow we can approach the minimum:
The plot illustrates a basic problem of this approach:
The gradient of a function is the collection of all its partial derivatives into a vector.
The gradient of $f(x,y)$ at point $x_0,y_0$ is
$$ \nabla f(x_0, y_0) = \left[ \begin{array}{c} \frac{\delta f}{\delta x} (x_0,y_0) \\ \frac{\delta f}{\delta y} (x_0,y_0) \\ \end{array} \right] $$When we compute the gradient at a given point we get a vector as a result.
This vector points in the direction of steepest ascent (or descent if multiplied by -1)
For our error function with two weights $v,w$ as parameters:
$$ \nabla E(v_0, w_0) = \left[ \begin{array}{c} \frac{\delta E}{\delta v} (v_0,w_0) \\ \frac{\delta E}{\delta w} (v_0,w_0) \\ \end{array} \right] $$To illustrate this in a plot we assume the following very simple error function:
$$ E(v,w) = v^2 + w ^2 $$The gradient is
$$ \nabla E = \left[ \begin{array}{c} \frac{\delta E}{\delta v} \\ \frac{\delta E}{\delta w} \\ \end{array} \right] = \left[ \begin{array}{c} 2 v \\ 2 w \\ \end{array} \right] $$E = lambda v,w: v**2 + w**2
E_ = lambda v,w: [2*v, 2*w]
from mpl_toolkits.mplot3d import Axes3D
vs = np.linspace(-3, 3, 30)
ws = np.linspace(-3, 3, 30)
Xg, Yg = np.meshgrid(vs, ws)
Zg = [ [ E(v, w) for w in ws ] for v in vs ]
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(Xg, Yg, Zg, 50, cmap='binary')
for g1, g2 in [[2,-2],[2,0],[2,2],[0,2],[-2,2]]:
ax.quiver(g1, g2, 0, g1 + E_(g1,g2)[0], g2 + E_(g1,g2)[1], 0,
color = 'red', lw = 0.5, length=1, normalize=True)
ax.view_init(50, 50)
plt.xlabel('v')
plt.ylabel('w')
plt.show()
Our cost function is
$$ E = \sum_n^N \frac{1}{2} (a_n - t_n)^2 $$Since the derivative of a sum is the sum of the derivatives we drop the sum for now and come back to it when we calculate weight updates from several observations.
This leaves us with
$$ E = \frac{1}{2} (a - t)^2 $$We want to find the derivative with respect to weight $w_i$:
$$ \frac{\delta E}{\delta w_i} = \frac{\delta}{\delta w_i} \frac{1}{2} (a - t)^2 $$Let
$$ \begin{align} h & = \sum w_i x_i \\ a & = f(h) \end{align} $$To get the derivative of a composite function $F(x) = f(g(x))$ we use the chain rule $F'(x) = f'(g(x)) g'(x)$
$$ \begin{align} \frac{\delta E}{\delta w_i} & = (a - t) \frac{\delta a}{\delta w_i} \\ & = (a - t) f'(h) \frac{\delta}{\delta w_i} \sum_j w_j x_j \end{align} $$Note that in the sum all terms $ j \ne i $ become constants with respect to $w_i$ and do not contribute to the derivative:
$$ \frac{\delta}{\delta w_i} \sum_j w_j x_j = \frac{\delta}{\delta w_i} [ w_1 x_1 + ... + w_i x_i + ... w_n x_n ] = x_i $$This gives us the result:
$$ \frac{\delta E}{\delta w_i} = (a - t) f'(h) x_i $$For a specific activation function $f$ such as the sigmoid we can substitute $f'(h)$ with the derivative:
$$ \frac{\delta E}{\delta w_i} = (a - t) f(h) (1 - f(h)) x_i $$Having found the gradient i.e. the direction in terms of weights for decreasing error we add a small learning rate $l$ to compute the weight updates:
$$ \Delta w_i = l \frac{\delta E}{\delta w_i} = - l (a - t) f'(h) x_i $$Now we can re-introduce the outer sum over the observations ${\bf x}_n$ that we dropped earlier.
In our implementation we perform a step-wise gradient descent in the direction of lower error by summing over $N$ input observations:
$$ w_i \leftarrow w_i + \sum_n \Delta w_i $$The dot product provides a neat way to write and (in Python) to compute matrix-vector multiplication:
$$ h = \sum_i w_i x_i = {\bf w} \cdot {\bf x}$$where ${\bf w}$ is the weight vector and ${\bf x}$ is the input vector.
Since we have a number of inputs we put them in a matrix ${\bf X}$ such that each line becomes an input vector.
Then we can write matrix-vector multiplication, ommitting the dot:
$$ {\bf h} = {\bf X w} $$The dot() function in the Python package numpy performs exactly that computation.
Another useful numpy feature is the automatic element-wise computation for vectors:
def f(x):
return 1. / (1. + np.exp(-x))
def nn1(X, y, l=0.01, epochs=100):
np.random.seed(1)
w = np.random.rand(X.shape[1]) - 0.5
for ep in range(epochs):
h = np.dot(X, w)
a = f(h)
e = a - y
w += np.dot(-l * (a - y) * f(h) * (1 - f(h)), X)
if ep % (epochs/10) == 0: print(w, sum(e**2))
return w, sum(e**2)
nn1(data[:100,], y)
We can see that with proper choice of learning rate the squared errors go towards zero; however, since the output function is not a 0/1 step function as in the case of the perceptron, we do not achieve a sum of squares equal to zero.
It is interesting to take a look at the search space for our optimization problem. We will produce a plot that shows the value of the cost function i.e. the sum of squared errors for a large number of parameters i.e. weights.
Since it is diffult to produce a surface plot for more than two parameters we limit this task to the first two input columns of the dataset:
nn1(data[:100,:2], y)
With fewer input data values the algorithm performs somewhat worse, which is to be expected.
We can easily compute the errors for given weights:
$$ \begin{align} a_n & = f(\sum_i w_i x_{n,i}) \\ E & = \sum_n (a_n - t_n)^2 \end{align} $$The following code plots $E$ for a suitable range of values for $w_1$ and $w_2$:
from mpl_toolkits.mplot3d import Axes3D
def plotE(X, y, g, view=None):
w1 = np.linspace(g[0], g[1], 30)
w2 = np.linspace(g[2], g[3], 30)
Xg, Yg = np.meshgrid(w1, w2)
n = w1.shape[0]
E = np.zeros((n,n))
for i in range(n):
for j in range(n):
h = np.dot(X, [Xg[i,j], Yg[i,j]])
E[i,j] = sum((f(h) - y)**2)
fig = plt.figure()
ax = plt.axes(projection='3d')
if view: ax.view_init(view[0], view[1])
ax.plot_wireframe(Xg, Yg, E, rstride=1, cstride=1)
plt.show()
plotE(data[:100,:2], y, [0, 3, -4, -1])
This surface shows a valley that is slowly flattening out.
A different view shows this more clearly.
plotE(data[:100,:2], y, [1, 10, -12, -2], [0,0])
This leads us to the idea of going to much higher number of epochs in order to arrive at a better solution:
w, se = nn1(data[:100,:2], y, epochs=20000)
During the first few hundred epochs we see a large change in the error sum, but after that we are slowing down.
The problem is the gradient of the activation function.
A problem with the sigmoid function is the vanishing gradient, caused by the fact that the derivative of the sigmoid function goes towards zero quickly as we move past about [-5, 5]. This is obvious from plotting the function for a broader range:
import numpy as np
xs = np.arange(-10, +10, .01)
ys = 1. / (1. + np.exp(-xs))
plt.plot(xs, ys, lw=1)
plt.show()
Beyond [-5, 5] the plot is practically flat; the derivative below shows the actual value of $f'$.
f = lambda x: 1. / (1. + np.exp(-x))
xs = np.arange(-10, +10, .01)
plt.plot(xs, f(xs)*(1 - f(xs)), lw=1)
plt.show()
Let's look at the average input sums for the output node:
np.mean(np.dot(data[:50,:2], w) )
np.mean(np.dot(data[50:100,:2], w))
We find that we are already beyond the -5/+5 range where the derivative of the sigmoid function is starting to flatten out.
What does this mean for our weight updates?
h = np.dot(data[:50,:2], w)
a = f(h)
l = 0.005
np.dot(-l * (a - y[:50]) * f(h) * (1 - f(h)), data[:50,:2])
The weight updates have become very small.
One obvious solution seems to be to simply increase the learning rate:
nn1(data[:100,:2], y, l=0.02)
This looks good, let's increase it further:
nn1(data[:100,:2], y, l=0.05)
Target overshot. The learning rate is a fickly parameter that needs to be chosen carefully.
Some gradient descent approaches change the learning rate during the training phase.
EXERCISES: