In this section we will implement a two-layer feed-forward network in pure Python and then compare it to a version in the Keras framework.
For a data file we will again use our trusted Iris data, but in order to see that our implementation works for any number of output nodes we add another dimension the the y vector; the values in the first column are 1 if this is an observation of the first class, and so on.
import numpy as np
np.random.seed(1)
data = np.genfromtxt('iris.data', delimiter=',', usecols=(0,1,2,3))
X = data[:100,]
y = np.zeros((len(X),2))
y[:50,0] = 1
y[50:,1] = 1
print(y[:3,])
print(y[-3:,])
We will not go into the derivation of the gradient for the two-layer net but simply state that with index $i$ for the input layer, $j$ for the hidden layer, $k$ for the output layer, $y_k$ the target output and $l$ the learning rate, we have
\begin{align} z_j & = \sum_i x_i w_{ij} \\ a_j & = f_j(z_j) \\ z_k & = \sum_j a_j w_{jk} \\ a_k & = f_k(z_k) \end{align}\begin{align} \delta_k & = (a_k-y_k) f'_k(z_k) \\ \delta_j & = f'_j(z_j) \sum_k \delta_k w_{jk} \\ w_{jk} & \leftarrow w_{jk} - l \delta_k a_j \\ w_{ij} & \leftarrow w_{ij} - l \delta_j a_i \end{align}This translates beautifully to the following code:
f = lambda x: 1. / (1. + np.exp(-x))
f_ = lambda x: f(x) * (1 - f(x))
def nn2(X, y, l=0.01, epochs=100, h=3):
if y.ndim==1: y = np.reshape(np.ravel(y), (len(y),1))
Wij = np.random.rand(h, X.shape[1]) - 0.5
Wjk = np.random.rand(y.shape[1], h) - 0.5
for ep in range(epochs):
zj = np.dot(Wij, X.T)
aj = f(zj)
zk = np.dot(Wjk, aj)
ak = f(zk)
dk = (ak - y.T) * f_(zk)
dj = f_(zj) * np.dot(dk.T, Wjk).T
Wjk += np.dot(-l * dk, aj.T)
Wij += np.dot(-l * dj, X)
if ep % (epochs/10) == 0:
print(np.sum(abs(ak-y.T), axis=1)/len(X))
shapes('X,y,Wij,Wjk,zj,aj,zk,dk,dj',(X,y,Wij,Wjk,zj,aj,zk,dk,dj))
return Wij, Wjk
The shapes function shows us the shapes of the arrays involved, and even peeks into some values.
def shapes(names,values):
names = names.split(',')
for i in range(len(names)): print(names[i]+':',values[i].shape, np.ravel(values[i])[:5])
Wij, Wjk = nn2(X, y)
Let us go through this step by step:
After some weight initializing we calculate the sum of inputs for the hidden layer
Wij is 3x4 and X is 100x4 so for dot product we must transpose X. This gives us the observations as columns instead of rows. Now we can use dot() to compute the activations, again with columns for observations:
zj = np.dot(Wij, X.T)
zj.shape
The activation is simple, it expands automatically to all values:
aj = f(zj)
aj.shape
We have the same structure here as for X, with columns for observations. Therefore, the next steps are nothing new, except there is now only one value for the output layer, and each observation:
zk = np.dot(Wjk, aj)
ak = f(zk)
ak.shape
Now for the backpropagation of errors: we need to compute the values for the learning rule, starting with dk:
dk = (ak - y.T) * f_(zk)
dk.shape
Nothing tricky there. The next one is a little more involved: when we multiply with dk we want the sum over k so we need to do a little transposing:
dk.T.shape
Wjk.shape
np.dot(dk.T, Wjk).shape
np.dot(dk.T, Wjk).T.shape
The sum over k has been implemented by using the dot() product.
Now we can element-wise multiply with f_(zj), since
f_(zj).shape
dj = f_(zj) * np.dot(dk.T, Wjk).T
dj.shape
Something similar happens with the weight update: again we get the sum over all observations, or a minibatch.
aj.shape
With the transpose of aj we can use the dot() product to sum over the observations:
l = 0.01
np.dot(-l * dk, aj.T).shape
And the weight update for Wij is even simpler, since both dj and X are already in the proper shape for the dot() product:
np.dot(-l * dj, X).shape
If you are not sure about this implementation just try each step for yourself, add some more printout, really get into the details. Numpy is a very powerful package, but like all powerful tools it need some practise for good effect.
Just to make sure that our code actually works with any number of output nodes we apply it to the original y vector:
y = y[:,0]
nn2(X,y)
The above is a very nice programming exercise; however, in practical applications we will want to use a package like Keras for any neural architecture that is a little more complex, for several reasons:
The last point is of course of particular interest to deep learning applications which tend to be very costly in terms of computing power. How much faster can we get? To give a ball park figure, if we have a problem with many thousands of observations a mid-range GPU easily results in a speedup of 5 times out of the box.
For the code below you may have to install the package keras first, using pip3 or conda in the usual manner.
The model is sequential in the sense that we are stacking layers that each have one input and one output. We can use recurrent components such as LSTM.
The basic layer is Dense which means that each unit is connected to every unit in the next layer.
As this is the first layer after the input layer the units parameter is the number of units in the hidden layer.
We have choice of activation functions; among the more common are
The next layer is our output layer. It has to comform to the number of output values, in this case two.
The compile() function creates the model with the computation graph in the format required by the backend; there is a configuration file in $HOME/.keras where you can set the backend; the following should be supported:
The metrics accuracy in this case means that the higher value in the output vector will be taken to indicate the class.
The batch size is the size of the minibatch; usually something like 64 or 32, but for this very small problem a smaller value works better.
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(units=16, input_dim=X.shape[1], activation="sigmoid"))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X, y, epochs=10, batch_size=5)
The beauty of this approach is that now we can easily add more layers to our net.
The only thing we have to specify is the number of units in the new layer, input and output shapes follow automatically.
We also use the validation split parameter to get a more realistic idea of the performance.
model = Sequential()
model.add(Dense(units=16, input_dim=X.shape[1], activation="sigmoid"))
model.add(Dense(8))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X, y, epochs=10, batch_size=5, validation_split=0.1)
EXERCISES:
The last points requires not only a compatible card but also a fair amount of installation and configuration; expect to invest a few hours before everything works.
However, once everything is set up your Keras code will run without any change on the GPU.