Multi-Layer and Recurrent Neural Nets

Using the Keras package makes it very easy to change the network architecture and introduce more than one hidden layer, as well as switching between activation functions.

The Tensorflow backend automatically recomputes the gradient for the learning algorithm. Without this feature, it would be very tedious or practically infeasible to manually re-code the gradients.

Architecures with more than one hidden layer are commonly referred to as Deep Learning.

Multi-Layer Feed-Forward: Pima Indians

The Pima indians dataset that has been used extensively in machine learning. It contains 768 observations consisting of 8 diagnostic values and a boolean variable indicating cases of diabetes within 5 years after examination.

We will employ a deep architecture on this dataset. Since the number of diabetes cases is only 268 we use class weights to account for the bias in the number of observations.

In practical applications we often encounter data
with missing values. The numpy function genfromtxt() can still read those data and use nan for not a number, but this often causes problems. We employ the numpy function isnan() to find observations with missing values and remove them from the data.

The numerical values in this dataset are of different magnitudes; some measure in the hundreds while others are small fractions; to facilitate the gradient descent learning we scale all columns in X by dividing them by their standard deviations.

In [6]:
import numpy as np
import io
from collections import Counter
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle, class_weight
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Dropout

data = np.genfromtxt('pima_indians_diabetes.txt', delimiter=',')
# remove lines with missing values, if any
data = data[~np.isnan(data).any(axis=1)]
print('observations:', data.shape[0])
# last col is value to predict
print('positive cases:', sum(data[:,-1]==1.0))
data = np.random.permutation(data)
X, y = data[:,:-1], data[:,-1]
# scale: divide by std dev
X = X / np.std(X, axis=0)
cw = class_weight.compute_class_weight('balanced', np.unique(y), y)

model = Sequential()
model.add(Dense(100, input_dim=X.shape[1], activation='tanh'))
model.add(Dropout(0.3))
model.add(Dense(50, activation='tanh'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=10, batch_size=32, validation_split=0.2, class_weight=cw)
observations: 768
positive cases: 268
Train on 614 samples, validate on 154 samples
Epoch 1/10
614/614 [==============================] - 1s 936us/step - loss: 0.2479 - acc: 0.6205 - val_loss: 0.2197 - val_acc: 0.6494
Epoch 2/10
614/614 [==============================] - 0s 187us/step - loss: 0.2220 - acc: 0.6564 - val_loss: 0.2183 - val_acc: 0.6558
Epoch 3/10
614/614 [==============================] - 0s 151us/step - loss: 0.2175 - acc: 0.6710 - val_loss: 0.2184 - val_acc: 0.6558
Epoch 4/10
614/614 [==============================] - 0s 158us/step - loss: 0.2195 - acc: 0.6596 - val_loss: 0.2174 - val_acc: 0.6753
Epoch 5/10
614/614 [==============================] - 0s 161us/step - loss: 0.2081 - acc: 0.6873 - val_loss: 0.2196 - val_acc: 0.6623
Epoch 6/10
614/614 [==============================] - 0s 175us/step - loss: 0.1966 - acc: 0.6954 - val_loss: 0.2182 - val_acc: 0.6818
Epoch 7/10
614/614 [==============================] - 0s 156us/step - loss: 0.2014 - acc: 0.6938 - val_loss: 0.2125 - val_acc: 0.6753
Epoch 8/10
614/614 [==============================] - 0s 183us/step - loss: 0.2018 - acc: 0.6922 - val_loss: 0.2100 - val_acc: 0.6948
Epoch 9/10
614/614 [==============================] - 0s 190us/step - loss: 0.2016 - acc: 0.6922 - val_loss: 0.2080 - val_acc: 0.6818
Epoch 10/10
614/614 [==============================] - 0s 220us/step - loss: 0.1860 - acc: 0.7313 - val_loss: 0.2080 - val_acc: 0.6753
Out[6]:

Recurrent Neural Nets

Many machine learning applications work on sequences of data, such as natural language texts of variable length.

Converting these type of data into some fixed length format for processing with feed-forward networks is possible but difficult if the information contained in the order of the input is to be preserved.

Recurrent architectures naturally deal with sequences of variable length.

Basic Recurrent Net

A basic recurrent net maintains a memory state $h_t$ which is updated in each input step $t$. This type of net is suitable for processing short sequences, such as numerically encoded sentences:

The hidden state $h$ depends on the previous hidden state and the current input:

$ h_t = \sigma~ (W x_t + U h_{t-1})$

The output is computed from the hidden state:

$o_t = \sigma~ (V h_t) $

The problem with this approach is that if the gap between the relevant piece of input and the output is getting too large the net 'forgets' and cannot make the proper association.

LSTM

The problem of long-term dependencies is tackled by the LSTM (Long Short Term Memory) architecture [HS97].

Instead of updating only the hidden state in each time step the LSTM introduces another cell state $C_t$ which is managed in a more sophisticated manner:

  • a forget gate decides on what to throw away,
  • an input gate decides on what to update, and
  • an output gate decides on what to output

From the current input and the previous hidden state the values for the gates and the new candidate values for the cell state $C$ are computed:

$f_t = \sigma ~ (W_f \cdot [h_{t-1}, x_t] + b_f)$

$i_t = \sigma ~ (W_i \cdot [h_{t-1}, x_t] + b_i)$

$o_t = \sigma ~ (W_o [ h_{t-1}, x_t] + b_o)$

$\tilde{C} = \tanh (W_C \cdot [h_{t-1}, x_t] + b_C)$

The new cell state $C_t$ is computed by 'forgetting' part of the previous state $C_{t-1}$ and (based on the current input) adding part of the candidate values $\tilde{C}_t$:

$C_t = f_t * C_{t-1} + i_t * \tilde{C}_t$

The new hidden state is based on the cell state and the current output:

$h_t = o_t * \tanh (C_t)$

Keras Implementation

The details of training a recurrent neural net are involved, but fortunately the Keras package takes care of that and allows us to concentrate on the data preparation and parameter tuning.

As always we start with a number of imports to make use of Keras and sklearn code.

We are using a rather small dataset here to allow for fast download. However, there are about 10000 sentences of variable length in two sets of equal size, labeled positive and negative. The task is to automatically predict the correct label (sentiment analysis).

The code below assumes that the two files rt-polarity.pos and rt-polarity.neg are present in the current directory.

We read the file line by line into a nested list of individual words, and the vector of associated sentiment labels.

We also update the word count; this will allow us to identify the most common words.

In [7]:
# http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz
sents = []
labs = []
c = Counter()
for suffix in ['pos', 'neg']:
  for line in io.open('rt-polarity.' + suffix, 'r', encoding='utf-8', errors='ignore'):
    words = line.strip().split() 
    sents += [ words ]
    labs += [ int(suffix == 'pos') ]
    c.update(words)

Next we shuffle the nested list and the corresponding labels.

The parameter topwords is the size of the vocabulary; words not present in this set will be ignored.

The words are then substituted by their index in the vocabulary. We arrive at a nested list of indices.

Note how the first sentence is encoded, and compare with the most common words in the vocabulary.

In [8]:
sents, labs = shuffle(np.array(sents), np.array(labs))
topwords = 10000
wlst = [ w for w, n in c.most_common(topwords) ]
vocd = { wlst[i]: i for i in range(len(wlst)) }
print('vocabulary:', wlst[:30], '...')
print('sentences', len(sents))
X = []
for s in sents:
  X += [ [ vocd[w]+1 if w in vocd else 0 for w in s ] ]

X = np.array(X)
y = np.array(labs)
print('input shape:', X.shape)
print('first sentence:', sents[0], 'label:', y[0])
print('encoding:', X[0])
vocabulary: ['.', 'the', ',', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'it', 'as', 'but', 'with', 'film', 'this', 'for', 'its', 'an', 'movie', "it's", 'be', 'on', 'you', 'not', 'by', 'about', 'more', 'one', 'like'] ...
sentences 10662
input shape: (10662,)
first sentence: ["it's", 'this', 'memory-as-identity', 'obviation', 'that', 'gives', 'secret', 'life', 'its', 'intermittent', 'unease', ',', 'reaffirming', 'that', 'long-held', 'illusions', 'are', 'indeed', 'reality', ',', 'and', 'that', 'erasing', 'them', 'recasts', 'the', 'self', '.'] label: 1
encoding: [21, 16, 0, 0, 10, 251, 2115, 93, 18, 4489, 6331, 3, 8033, 10, 0, 0, 32, 882, 490, 3, 5, 10, 0, 110, 0, 2, 6628, 1]

We pad the sequences so they are all the same length. This is necessary for the Keras package. Padding is done by adding zeros to the start of a line. Zeroes mean unknown which also works for words not in the dictionary.

This padding with zeroes is the reason we used index origin 1 in the previous part. In this case it does not make much of a difference as the first entry in the dictionary is the dot which does not convey much information anyway.

A sample sentence is printed so we can check the encoding.

In [9]:
maxlen=30
X = pad_sequences(X, maxlen=maxlen)
print(X[0])
[   0    0   21   16    0    0   10  251 2115   93   18 4489 6331    3
 8033   10    0    0   32  882  490    3    5   10    0  110    0    2
 6628    1]

We are now ready to build our model. The Keras package provides a convenient embedding layer that translated each word index into a vector of floating point numbers. This encoding is learned along with the other network parameters and saves us the trouble of coming up with our own word feature encoding.

As usual in this type of machine learning approaches the accuracy on the training set is higher than on the test set.

In [10]:
def nn():
  model = Sequential()
  model.add(Embedding(topwords+1, 50, input_length=maxlen))
  model.add(LSTM(100, dropout=0.2))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  model.fit(X, y, validation_split=0.2, epochs=10, batch_size=64)

nn()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 30, 50)            500050    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               60400     
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 101       
=================================================================
Total params: 560,551
Trainable params: 560,551
Non-trainable params: 0
_________________________________________________________________
None
Train on 8529 samples, validate on 2133 samples
Epoch 1/10
8529/8529 [==============================] - 18s 2ms/step - loss: 0.2218 - acc: 0.6275 - val_loss: 0.1828 - val_acc: 0.7243
Epoch 2/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1248 - acc: 0.8298 - val_loss: 0.1707 - val_acc: 0.7647
Epoch 3/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.0738 - acc: 0.9017 - val_loss: 0.1765 - val_acc: 0.7595
Epoch 4/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.0458 - acc: 0.9423 - val_loss: 0.1832 - val_acc: 0.7511
Epoch 5/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.0311 - acc: 0.9634 - val_loss: 0.1965 - val_acc: 0.7482
Epoch 6/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.0246 - acc: 0.9698 - val_loss: 0.2048 - val_acc: 0.7454
Epoch 7/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.0211 - acc: 0.9760 - val_loss: 0.2093 - val_acc: 0.7515
Epoch 8/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.0177 - acc: 0.9795 - val_loss: 0.2153 - val_acc: 0.7511
Epoch 9/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.0156 - acc: 0.9829 - val_loss: 0.2225 - val_acc: 0.7403
Epoch 10/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.0131 - acc: 0.9850 - val_loss: 0.2317 - val_acc: 0.7370

GPU Support

At the time of writing (May 2019) tensorflow only supports NVIDIA graphics cards with CUDA 3.5 or higher. You also need a card with reasonable performance to see any speedup at all; cheap entry models are the GTX 1050 Ti and the RTX 2060.

Installing all the required drivers and CUDA software can be a tedious task, but you only need to do it once (until you buy a new card or significantly change your system).

If a GPU device is shown when you execute the code below then the computations can run faster by a factor ranging from 2x to 10x or even 20x, depending on your graphics card and the task: speedup will only show for demanding applications; e.g. in this example you may need to increase the number of units in the LSTM to 200 or 300. Since there is considerable overhead in using the GPU it is faster to compute less demanding tasks in the CPU.

In [11]:
import tensorflow as tf
print('GPU Device:', tf.test.gpu_device_name())
GPU Device: 

If no GPU device is available then the Keras code will run fine on the CPU. It should use all available CPU cores if the linear algebra libraries are installed and configured; otherwise, only one core will be used, and the performance will suffer significantly.

The code below shows how to measure the speedup when comparing CPU and GPU computing.

In [ ]:
import tensorflow as tf
from time import time

with tf.device('/cpu:0'):
  t = time()
  nn()
  tcpu = time()-t
  print('Time CPU:', tcpu)
with tf.device('/gpu:0'):
  t = time()
  nn()
  tgpu = time()-t
  print('Time GPU:', tgpu)

print('Speedup:', tcpu/tgpu)

Word Embeddings

In this method of encoding each word is associated with a floating point vector of a fixed dimension; usually values from 25 to 500 are used. These vectors have been computed from very large text corpora such that words with similar meanings are assigned similar vectors.

The original Glove downloads are very large; an abbreviated version has been provided here; it only contains the most common 10k words and their embeddings in 50 dimensions.

http://balrog.wu.ac.at/~mitloehn/glove.10k.txt

The code below reads the encodings into a numpy array and checks some similarities. The length of the difference vector is computed with np.linalg.norm().

In [12]:
glove = np.genfromtxt('glove.10k.txt', dtype=str)
print(glove)
vocab = glove.shape[0]
idx = { glove[i,0] : i for i in range(vocab) } 
E = glove[:,1:].astype(float)

dog, cat, house = idx['dog'], idx['cat'], idx['house']
d1 = E[dog] - E[house]
d2 = E[dog] - E[cat]
for x in (E[dog], E[cat], E[house], d1, d2):
  print(np.linalg.norm(x))
[['the' '0.418' '0.24968' ... '-0.18411' '-0.11514' '-0.78581']
 [',' '0.013441' '0.23682' ... '-0.56657' '0.044691' '0.30392']
 ['.' '0.15164' '0.30177' ... '-0.35652' '0.016413' '0.10216']
 ...
 ['taxpayer' '0.43275' '-0.58476' ... '0.85172' '0.56007' '0.77719']
 ['resistant' '0.40705' '-1.1284' ... '-0.58008' '0.043716' '0.17185']
 ['quinn' '-0.5964' '-0.039918' ... '-0.067509' '0.20066' '0.85808']]
4.858045696798486
4.407863183896478
5.160355464594687
5.512427126451155
1.884603106672673

Some surprising operations with word embeddings are possible, such as the difference of vectors representing relationships:

E[France] - E[Paris] is similar to E[Italy] - E[Rome]

The relationship 'capital' has been captured. Note that this relies on co-occurrence of words in very large corpora, e.g. the Glove embeddings we use here are based on text corpora of 6 billion tokens.

In [13]:
france, paris, italy, rome = idx['france'], idx['paris'], idx['italy'], idx['rome']
cap1 = E[france] - E[paris]
cap2 = E[italy] - E[rome]
for x in (E[france], E[paris], E[italy], E[rome], cap1, cap2, cap1 - cap2):
  print(np.linalg.norm(x))
5.968037898663447
5.5478707823002695
5.679180696598178
4.899884224248329
3.640435481687349
3.804359806868363
3.004158088013115

Word embeddings can also be used to embed sentences by simply adding all the word vectors. The code below defines a function to encode a list of words into a single embedding vector. We then check whether similarity of meaning can still be observed. As we can see, it is less convincing with whole sentences.

In [14]:
def embed(lst):
  e = np.array([ E[idx[w]] for w in lst if w in idx ])
  if len(e) == 0: return E[idx['.']]
  else: return np.sum(e, axis=0) / len(e)

e1 = embed('the cat enjoys relaxing'.split())
e2 = embed('the dog likes to sleep'.split())
e3 = embed('the house is on fire'.split())

for x in (e1, e2, e3, e1-e2, e1-e3, e2-e3):
  print(np.linalg.norm(x))
3.307478679608635
3.6553124948189284
4.388313591786297
1.7353424753641697
2.594773420787756
2.3012279583162725

We can now use the pre-trained word embdings for the sentiment task on the movie reviews by supplying the Keras embeddings layer with the Glove weights.

In [15]:
print(len(sents), len(labs))
print(sents[0], labs[0])

indices = [ [ idx[w]+1 if w in idx else 0 for w in s ] for s in sents ]
maxlen = 30
W = np.append(np.zeros((1,50)), E, axis=0)
X = pad_sequences(np.array(indices), maxlen)
y = labs

model = Sequential()
model.add(Embedding(len(W), 50, input_length=maxlen, weights=[W], trainable=True))
model.add(LSTM(100, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=10, batch_size=64, validation_split=0.2)
10662 10662
["it's", 'this', 'memory-as-identity', 'obviation', 'that', 'gives', 'secret', 'life', 'its', 'intermittent', 'unease', ',', 'reaffirming', 'that', 'long-held', 'illusions', 'are', 'indeed', 'reality', ',', 'and', 'that', 'erasing', 'them', 'recasts', 'the', 'self', '.'] 1
Train on 8529 samples, validate on 2133 samples
Epoch 1/10
8529/8529 [==============================] - 18s 2ms/step - loss: 0.2393 - acc: 0.5810 - val_loss: 0.2132 - val_acc: 0.6592
Epoch 2/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.2142 - acc: 0.6602 - val_loss: 0.2002 - val_acc: 0.6882
Epoch 3/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1909 - acc: 0.7082 - val_loss: 0.1888 - val_acc: 0.7056
Epoch 4/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1745 - acc: 0.7335 - val_loss: 0.1906 - val_acc: 0.7154
Epoch 5/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1593 - acc: 0.7621 - val_loss: 0.1826 - val_acc: 0.7239
Epoch 6/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1461 - acc: 0.7857 - val_loss: 0.1836 - val_acc: 0.7286
Epoch 7/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1397 - acc: 0.8015 - val_loss: 0.1803 - val_acc: 0.7300
Epoch 8/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1276 - acc: 0.8207 - val_loss: 0.1845 - val_acc: 0.7300
Epoch 9/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1194 - acc: 0.8354 - val_loss: 0.1837 - val_acc: 0.7295
Epoch 10/10
8529/8529 [==============================] - 17s 2ms/step - loss: 0.1115 - acc: 0.8458 - val_loss: 0.1851 - val_acc: 0.7206
Out[15]:

It turns out that in this case using the pre-trained word embeddings did not result in performance improvement. However, the overfitting on the training set is not as pronounced in this version compared to the default randomly initialised Keras embeddings which proceed more quickly to the task-specific values.

BERT

Using pre-trained word embeddings is similar to using a pre-trained part of a neural net and applying it to a different problem. This idea is taken further with the latest advances in machine learning, exemplified by BERT, the Bidirectional Encoder Representations from Transformers. Essentially BERT is a component trained as a language model i.e. predicting words in sentences.

Training a neural architecture like BERT on a sufficiently huge corpus is computationally very expensive and is only feasible on very high performance hardware; however, pre-trained versions of BERT can be downloaded and used as part of a network as a ready-made component in other tasks that only require fine-tuning, and are therefore feasible on hardware more readily available.

[HS97] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory", Neural Computation 9(8): 1735-1780 (1997).

[BERT] J. Devlin et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

In [ ]: