Neural Nets with Keras

The Keras toolkit provides a number of commonly used neural net modules for use with Python. The use of a framework like Keras facilitates development; in conjunction with the default Tensorflow backend it is also easy to switch from CPU to GPU computing without any changes in the Python code.

The following example illustrates the use of Keras with the Iris dataset. The Iris dataset contains 150 observations on 5 variables. This dataset is small and well suited for demonstrating the main concepts of neural net machine learning. However, it should be noted that neural nets are typically trained on much larger datasets.

Prerequisites

To perform your own computations you can copy the code into a script file such as ml.py and run it on the the command line with

python3 ml.py

You may need to install the packages keras and tensorflow:

pip3 install keras --user

pip3 install tensorflow --user

The example assumes that the file iris.data is present in the current directory. This file is available from the UCI Machine Learning Repository at the URL https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Preparing the Data

The numpy utility genfromtxt() is used to read the file.

Column 4 (with index origin 0) contains the species name which is converted with the defined function conv. Unfortunately, in Python 3 the Numpy utility genfromtxt() does not return strings for text fields, but byte sequences; therefore, the comparison needs to specify byte sequences, indicated by the b in front of the string.

The print statements check whether the import was successful; the number of rows and columns should be 150 rows, 5 columns. The last column should only contain the numbers 1.0, 2.0, and 3.0. The first and last few rows are also printed.

In [41]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense

def conv(s):
    if s == b'Iris-setosa': return 0.0
    if s == b'Iris-versicolor': return 1.0
    if s == b'Iris-virginica': return 2.0
    return -1

data = np.genfromtxt('iris.data', delimiter=',', converters={4:conv})
print(data.shape) 
print(set(data[:,4]))
print(data[:3,:])
print(data[-3:,:])
(150, 5)
{0.0, 1.0, 2.0}
[[5.1 3.5 1.4 0.2 0. ]
 [4.9 3.  1.4 0.2 0. ]
 [4.7 3.2 1.3 0.2 0. ]]
[[6.5 3.  5.2 2.  2. ]
 [6.2 3.4 5.4 2.3 2. ]
 [5.9 3.  5.1 1.8 2. ]]

Dual Class Problem

To simplify the situation we take only the first 100 rows from the Iris data which contain the species setosa and versicolor. These two are linearly separable.

The input matrix X to the neural net consists of numeric values for sepal and petal length and width. The output vector y is the corresponding species coded as 0 and 1.

In [42]:
d = np.random.permutation(data[:100])
X = d[:,:4]
y = d[:,4]
print(X[:3,:])
print(y)
[[4.8 3.  1.4 0.1]
 [5.8 2.7 4.1 1. ]
 [5.5 2.3 4.  1.3]]
[0. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1.
 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0.
 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1.
 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1.
 0. 1. 0. 0.]

The neural net model consists of a densely connected hidden layer with a small number of units using the activation function tanh, followed by the densely connected output layer using a function with output between 0 and 1, such as the sigmoid function.

Keras provides several optimizers; Adam is used here, as it is generally a good choice. It adjust the learning rate during training. The loss function is mean squared error, and the metric is accuracy i.e. how often the correct class is predicted.

A small batch of observations is used in every of training step; after each step the weights are updated. When all data has been processed the next epoch begins.

The weights are initialised randomly, and the data is shuffled in each step. In most cases convergence is quick; however, in some cases the algorithm gets stuck in a local minimum of the cost function with the accuracy remaining at 0.5.

In [43]:
model = Sequential()
model.add(Dense(8, input_dim=4, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=10, batch_size=5)  
Epoch 1/10
100/100 [==============================] - 1s 7ms/step - loss: 0.2308 - acc: 0.5000
Epoch 2/10
100/100 [==============================] - 0s 709us/step - loss: 0.1970 - acc: 0.5300
Epoch 3/10
100/100 [==============================] - 0s 708us/step - loss: 0.1671 - acc: 0.9200
Epoch 4/10
100/100 [==============================] - 0s 591us/step - loss: 0.1460 - acc: 1.0000
Epoch 5/10
100/100 [==============================] - 0s 627us/step - loss: 0.1290 - acc: 1.0000
Epoch 6/10
100/100 [==============================] - 0s 669us/step - loss: 0.1151 - acc: 1.0000
Epoch 7/10
100/100 [==============================] - 0s 594us/step - loss: 0.1033 - acc: 1.0000
Epoch 8/10
100/100 [==============================] - 0s 556us/step - loss: 0.0937 - acc: 1.0000
Epoch 9/10
100/100 [==============================] - 0s 675us/step - loss: 0.0851 - acc: 1.0000
Epoch 10/10
100/100 [==============================] - 0s 646us/step - loss: 0.0780 - acc: 1.0000
Out[43]:
<keras.callbacks.History at 0x7ff8373fa908>

Multi-Class Problem

Since the full Iris data set contains three species the neural net needs to be modified to deal with all classes.

The output of the net will use the softmax function for vectors summing to one, in this case the prediction for the class such as 1 0 0 for setosa, 0 1 0 for versicolor, and 0 0 1 for virginica.

The training vector y need to be cast into this format which is done by the LabelEncoder in the sklearn package.

In [44]:
from keras.utils import np_utils

d = np.random.permutation(data)
X = d[:,:4]
y = np_utils.to_categorical(d[:,4])
print(X[:3,:])
print(y[:5])
[[5.  3.5 1.6 0.6]
 [5.  3.  1.6 0.2]
 [7.3 2.9 6.3 1.8]]
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]

The net needs small changes in the number of outputs and type of activation function.

The problem is now a little more complex; an accuracy of about 0.95 is feasible.

In [45]:
model = Sequential()
model.add(Dense(12, input_dim=4, activation='tanh'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=10, batch_size=5)  
Epoch 1/10
150/150 [==============================] - 1s 5ms/step - loss: 0.2549 - acc: 0.3400
Epoch 2/10
150/150 [==============================] - 0s 593us/step - loss: 0.2194 - acc: 0.6667
Epoch 3/10
150/150 [==============================] - 0s 564us/step - loss: 0.1938 - acc: 0.6667
Epoch 4/10
150/150 [==============================] - 0s 560us/step - loss: 0.1613 - acc: 0.6667
Epoch 5/10
150/150 [==============================] - 0s 602us/step - loss: 0.1273 - acc: 0.6800
Epoch 6/10
150/150 [==============================] - 0s 572us/step - loss: 0.1110 - acc: 0.8200
Epoch 7/10
150/150 [==============================] - 0s 605us/step - loss: 0.1039 - acc: 0.8200
Epoch 8/10
150/150 [==============================] - 0s 616us/step - loss: 0.0984 - acc: 0.8400
Epoch 9/10
150/150 [==============================] - 0s 651us/step - loss: 0.0928 - acc: 0.8733
Epoch 10/10
150/150 [==============================] - 0s 579us/step - loss: 0.0884 - acc: 0.8867
Out[45]:
<keras.callbacks.History at 0x7ff836de2780>

Overfitting is a problem for this type of machine learning. With a sufficient number of hidden units the net can approximate any function to any given degree. Therefore, the performance on the training data is not the most meaningful information.

For a proper evaluation we need to separate into training and validation data since we are interested in the performance on unseen data.

We use an 80/20 split: we remove the last 30 observations from the training set and use them for evaluation.

In [46]:
d = np.random.permutation(data)
X = d[:,:4]
y = np_utils.to_categorical(d[:,4])
Xt = X[:-30,:]
yt = y[:-30,:]
Xv = X[-30:,:]
yv = y[-30:,:]
[[0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]
In [47]:
model.fit(Xt, yt, epochs=10, batch_size=5)
model.evaluate(Xv, yv)
Epoch 1/10
120/120 [==============================] - 0s 599us/step - loss: 0.0842 - acc: 0.9333
Epoch 2/10
120/120 [==============================] - 0s 543us/step - loss: 0.0800 - acc: 0.9333
Epoch 3/10
120/120 [==============================] - 0s 551us/step - loss: 0.0783 - acc: 0.9250
Epoch 4/10
120/120 [==============================] - 0s 593us/step - loss: 0.0752 - acc: 0.9417
Epoch 5/10
120/120 [==============================] - 0s 545us/step - loss: 0.0736 - acc: 0.9333
Epoch 6/10
120/120 [==============================] - 0s 575us/step - loss: 0.0709 - acc: 0.9333
Epoch 7/10
120/120 [==============================] - 0s 624us/step - loss: 0.0693 - acc: 0.9333
Epoch 8/10
120/120 [==============================] - 0s 568us/step - loss: 0.0669 - acc: 0.9583
Epoch 9/10
120/120 [==============================] - 0s 618us/step - loss: 0.0655 - acc: 0.9500
Epoch 10/10
120/120 [==============================] - 0s 578us/step - loss: 0.0630 - acc: 0.9583
30/30 [==============================] - 0s 9ms/step
Out[47]:
[0.07555551081895828, 0.8999999761581421]

The result is typical for this type of machine learning: the performance on the training set is very good, but significantly worse on the test set.

Careful tuning of the parameters can achieve a good balance; however, it introduces another problem: when we are tuning in order to achieve better performance on the test data we are again performing an optimization on that data. It is therefore common practise to separate the data into three sets:

  • training data: fit the model
  • validation data: evaluate performance and update parameters
  • test data: for final evaluation; no more parameter updates

This approach only works if the evaluation on the test data is in fact final. There is a certain temptation to disregard procedure; competitions are one way to avoid this problem: participants submit the output of their approach based on the test data input; the correct output for the test data is only published after the competition is finished.

In [ ]: