Neural Nets in pure Python and Keras

In this section we will implement a two-layer feed-forward network in pure Python and then compare it to a version in the Keras framework.

For a data file we will again use our trusted Iris data, but in order to see that our implementation works for any number of output nodes we add another dimension the the y vector; the values in the first column are 1 if this is an observation of the first class, and so on.

In [93]:
import numpy as np
np.random.seed(1)

data = np.genfromtxt('iris.data', delimiter=',', usecols=(0,1,2,3))
X = data[:100,] 
y = np.zeros((len(X),2))
y[:50,0] = 1
y[50:,1] = 1
print(y[:3,])
print(y[-3:,])
[[1. 0.]
 [1. 0.]
 [1. 0.]]
[[0. 1.]
 [0. 1.]
 [0. 1.]]

We will not go into the derivation of the gradient for the two-layer net but simply state that with index $i$ for the input layer, $j$ for the hidden layer, $k$ for the output layer, $y_k$ the target output and $l$ the learning rate, we have

\begin{align} z_j & = \sum_i x_i w_{ij} \\ a_j & = f_j(z_j) \\ z_k & = \sum_j a_j w_{jk} \\ a_k & = f_k(z_k) \end{align}\begin{align} \delta_k & = (a_k-y_k) f'_k(z_k) \\ \delta_j & = f'_j(z_j) \sum_k \delta_k w_{jk} \\ w_{jk} & \leftarrow w_{jk} - l \delta_k a_j \\ w_{ij} & \leftarrow w_{ij} - l \delta_j a_i \end{align}

This translates beautifully to the following code:

In [94]:
f = lambda x: 1. / (1. + np.exp(-x))
f_ = lambda x: f(x) * (1 - f(x))

def nn2(X, y, l=0.01, epochs=100, h=3):
    if y.ndim==1: y = np.reshape(np.ravel(y), (len(y),1))
    Wij = np.random.rand(h, X.shape[1]) - 0.5
    Wjk = np.random.rand(y.shape[1], h) - 0.5
    for ep in range(epochs):
        zj = np.dot(Wij, X.T)
        aj = f(zj)
        zk = np.dot(Wjk, aj)
        ak = f(zk)
        dk = (ak - y.T) * f_(zk)
        dj = f_(zj) * np.dot(dk.T, Wjk).T
        Wjk += np.dot(-l * dk, aj.T)
        Wij += np.dot(-l * dj, X)
        if ep % (epochs/10) == 0: 
            print(np.sum(abs(ak-y.T), axis=1)/len(X))
    shapes('X,y,Wij,Wjk,zj,aj,zk,dk,dj',(X,y,Wij,Wjk,zj,aj,zk,dk,dj))
    return Wij, Wjk

The shapes function shows us the shapes of the arrays involved, and even peeks into some values.

In [95]:
def shapes(names,values):
    names = names.split(',')
    for i in range(len(names)): print(names[i]+':',values[i].shape, np.ravel(values[i])[:5])
In [96]:
Wij, Wjk = nn2(X, y)
[0.51196745 0.5063713 ]
[0.49514838 0.49941321]
[0.48286028 0.49321001]
[0.46600635 0.47944759]
[0.43942377 0.45398   ]
[0.38967362 0.4062853 ]
[0.3267855 0.3450476]
[0.26991601 0.28626563]
[0.22614773 0.23952279]
[0.19422023 0.20494774]
X: (100, 4) [5.1 3.5 1.4 0.2 4.9]
y: (100, 2) [1. 0. 1. 0. 1.]
Wij: (3, 4) [ 0.17391637  0.72534166 -1.07654751 -0.45805708  0.26711571]
Wjk: (2, 3) [ 1.04348379  1.1333032  -1.93411107 -1.20465651 -0.87925378]
zj: (3, 100) [1.81629464 1.42125002 1.63783855 1.33453408 1.87110746]
aj: (3, 100) [0.86012092 0.80553431 0.83724061 0.79158964 0.86658637]
zk: (2, 100) [1.69790601 1.51901351 1.61755983 1.44993611 1.71367089]
dk: (2, 100) [-0.02023905 -0.02646475 -0.02286753 -0.0292441  -0.01975396]
dj: (3, 100) [-0.00572791 -0.00968012 -0.00730032 -0.01120838 -0.00537294]

Let us go through this step by step:

After some weight initializing we calculate the sum of inputs for the hidden layer

  • for all units
  • for all observations

Wij is 3x4 and X is 100x4 so for dot product we must transpose X. This gives us the observations as columns instead of rows. Now we can use dot() to compute the activations, again with columns for observations:

In [97]:
zj = np.dot(Wij, X.T)
zj.shape
Out[97]:
(3, 100)

The activation is simple, it expands automatically to all values:

In [98]:
aj = f(zj)
aj.shape
Out[98]:
(3, 100)

We have the same structure here as for X, with columns for observations. Therefore, the next steps are nothing new, except there is now only one value for the output layer, and each observation:

In [113]:
zk = np.dot(Wjk, aj)
ak = f(zk)
ak.shape
Out[113]:
(2, 100)

Now for the backpropagation of errors: we need to compute the values for the learning rule, starting with dk:

In [100]:
dk = (ak - y.T) * f_(zk)
dk.shape
Out[100]:
(2, 100)

Nothing tricky there. The next one is a little more involved: when we multiply with dk we want the sum over k so we need to do a little transposing:

In [101]:
dk.T.shape
Out[101]:
(100, 2)
In [102]:
Wjk.shape
Out[102]:
(2, 3)
In [103]:
np.dot(dk.T, Wjk).shape
Out[103]:
(100, 3)
In [104]:
np.dot(dk.T, Wjk).T.shape
Out[104]:
(3, 100)

The sum over k has been implemented by using the dot() product.

Now we can element-wise multiply with f_(zj), since

In [105]:
f_(zj).shape
Out[105]:
(3, 100)
In [106]:
dj = f_(zj) * np.dot(dk.T, Wjk).T
dj.shape
Out[106]:
(3, 100)

Something similar happens with the weight update: again we get the sum over all observations, or a minibatch.

In [107]:
aj.shape
Out[107]:
(3, 100)

With the transpose of aj we can use the dot() product to sum over the observations:

In [108]:
l = 0.01
np.dot(-l * dk, aj.T).shape
Out[108]:
(2, 3)

And the weight update for Wij is even simpler, since both dj and X are already in the proper shape for the dot() product:

In [109]:
np.dot(-l * dj, X).shape
Out[109]:
(3, 4)

If you are not sure about this implementation just try each step for yourself, add some more printout, really get into the details. Numpy is a very powerful package, but like all powerful tools it need some practise for good effect.

Just to make sure that our code actually works with any number of output nodes we apply it to the original y vector:

In [110]:
y = y[:,0]
nn2(X,y)
[0.48916721]
[0.46808014]
[0.45205962]
[0.43523897]
[0.41807117]
[0.4017298]
[0.3864137]
[0.37001127]
[0.34199635]
[0.31002911]
X: (100, 4) [5.1 3.5 1.4 0.2 4.9]
y: (100, 1) [1. 1. 1. 1. 1.]
Wij: (3, 4) [-0.40589027 -0.93351618  1.46398987  0.955531   -0.42741509]
Wjk: (1, 3) [-1.94310531 -0.20146013  1.06507634]
zj: (3, 100) [-3.08233592 -2.53708385 -2.78768227 -2.36293447 -3.13477616]
aj: (3, 100) [0.04384179 0.07329901 0.05799344 0.08604315 0.04169535]
zk: (1, 100) [0.73873853 0.65328486 0.68745739 0.6032887  0.7405451 ]
dk: (1, 100) [-0.070724   -0.07704548 -0.07449594 -0.08081853 -0.07059242]
dj: (3, 100) [0.00571251 0.01008387 0.00784165 0.01224602 0.0054349 ]
Out[110]:
(array([[-0.40589027, -0.93351618,  1.46398987,  0.955531  ],
        [-0.42741509, -0.0910046 ,  0.53596936,  0.47762913],
        [ 0.28175007,  0.19779925, -0.48905633,  0.26153132]]),
 array([[-1.94310531, -0.20146013,  1.06507634]]))

The Keras Version

The above is a very nice programming exercise; however, in practical applications we will want to use a package like Keras for any neural architecture that is a little more complex, for several reasons:

  • We can easily add various types of layers with various activation functions
  • Keras computes the gradients automatically
  • If we have a supported and sufficiently powerful graphics card it can be used for computation

The last point is of course of particular interest to deep learning applications which tend to be very costly in terms of computing power. How much faster can we get? To give a ball park figure, if we have a problem with many thousands of observations a mid-range GPU easily results in a speedup of 5 times out of the box.

For the code below you may have to install the package keras first, using pip3 or conda in the usual manner.

The model is sequential in the sense that we are stacking layers that each have one input and one output. We can use recurrent components such as LSTM.

The basic layer is Dense which means that each unit is connected to every unit in the next layer.

As this is the first layer after the input layer the units parameter is the number of units in the hidden layer.

We have choice of activation functions; among the more common are

  • sigmoid
  • tanh
  • relu (Rectified Linear Unit)

The next layer is our output layer. It has to comform to the number of output values, in this case two.

The compile() function creates the model with the computation graph in the format required by the backend; there is a configuration file in $HOME/.keras where you can set the backend; the following should be supported:

  • theano
  • tensorflow (may require additional installation)

The metrics accuracy in this case means that the higher value in the output vector will be taken to indicate the class.

The batch size is the size of the minibatch; usually something like 64 or 32, but for this very small problem a smaller value works better.

In [111]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(units=16, input_dim=X.shape[1], activation="sigmoid"))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X, y,  epochs=10, batch_size=5)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_102 (Dense)            (None, 16)                80        
_________________________________________________________________
dense_103 (Dense)            (None, 1)                 17        
=================================================================
Total params: 97
Trainable params: 97
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
100/100 [==============================] - 0s 360us/step - loss: 0.2581 - acc: 0.5000
Epoch 2/10
100/100 [==============================] - 0s 203us/step - loss: 0.2420 - acc: 0.5000
Epoch 3/10
100/100 [==============================] - 0s 259us/step - loss: 0.2284 - acc: 0.6900
Epoch 4/10
100/100 [==============================] - 0s 177us/step - loss: 0.2157 - acc: 0.9900
Epoch 5/10
100/100 [==============================] - 0s 161us/step - loss: 0.2041 - acc: 1.0000
Epoch 6/10
100/100 [==============================] - 0s 192us/step - loss: 0.1928 - acc: 1.0000
Epoch 7/10
100/100 [==============================] - 0s 156us/step - loss: 0.1811 - acc: 1.0000
Epoch 8/10
100/100 [==============================] - 0s 186us/step - loss: 0.1697 - acc: 1.0000
Epoch 9/10
100/100 [==============================] - 0s 149us/step - loss: 0.1569 - acc: 1.0000
Epoch 10/10
100/100 [==============================] - 0s 161us/step - loss: 0.1461 - acc: 1.0000
Out[111]:
<keras.callbacks.History at 0x7f8fc4dc88d0>

The beauty of this approach is that now we can easily add more layers to our net.

The only thing we have to specify is the number of units in the new layer, input and output shapes follow automatically.

We also use the validation split parameter to get a more realistic idea of the performance.

In [112]:
model = Sequential()
model.add(Dense(units=16, input_dim=X.shape[1], activation="sigmoid"))
model.add(Dense(8))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X, y,  epochs=10, batch_size=5, validation_split=0.1)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_104 (Dense)            (None, 16)                80        
_________________________________________________________________
dense_105 (Dense)            (None, 8)                 136       
_________________________________________________________________
dense_106 (Dense)            (None, 1)                 9         
=================================================================
Total params: 225
Trainable params: 225
Non-trainable params: 0
_________________________________________________________________
None
Train on 90 samples, validate on 10 samples
Epoch 1/10
90/90 [==============================] - 0s 172us/step - loss: 0.2585 - acc: 0.5556 - val_loss: 0.3993 - val_acc: 0.0000e+00
Epoch 2/10
90/90 [==============================] - 0s 236us/step - loss: 0.2407 - acc: 0.5556 - val_loss: 0.2911 - val_acc: 0.0000e+00
Epoch 3/10
90/90 [==============================] - 0s 169us/step - loss: 0.2291 - acc: 0.5667 - val_loss: 0.2656 - val_acc: 0.0000e+00
Epoch 4/10
90/90 [==============================] - 0s 174us/step - loss: 0.2207 - acc: 0.7889 - val_loss: 0.2492 - val_acc: 0.7000
Epoch 5/10
90/90 [==============================] - 0s 402us/step - loss: 0.2105 - acc: 0.8333 - val_loss: 0.2324 - val_acc: 0.8000
Epoch 6/10
90/90 [==============================] - 0s 223us/step - loss: 0.2000 - acc: 0.8111 - val_loss: 0.2520 - val_acc: 0.7000
Epoch 7/10
90/90 [==============================] - 0s 145us/step - loss: 0.1857 - acc: 0.9778 - val_loss: 0.2074 - val_acc: 1.0000
Epoch 8/10
90/90 [==============================] - 0s 197us/step - loss: 0.1711 - acc: 1.0000 - val_loss: 0.2072 - val_acc: 1.0000
Epoch 9/10
90/90 [==============================] - 0s 149us/step - loss: 0.1553 - acc: 1.0000 - val_loss: 0.1821 - val_acc: 1.0000
Epoch 10/10
90/90 [==============================] - 0s 207us/step - loss: 0.1390 - acc: 1.0000 - val_loss: 0.1577 - val_acc: 1.0000
Out[112]:
<keras.callbacks.History at 0x7f8fc4243e10>

EXERCISES:

  • find more datasets and apply the basic Keras code above
  • change parameters, add more layers, and observe the results
  • if you have a graphics card: get the code to run on the GPU

The last points requires not only a compatible card but also a fair amount of installation and configuration; expect to invest a few hours before everything works.

However, once everything is set up your Keras code will run without any change on the GPU.

In [ ]: