Recurrent Neural Nets

As an example of one of the many advanced neural net architectures we will now look at the recurrent network. I this approach the current input signal has a sequence or time index, e.g. the current word in a text such as a short product review, whose sentiment is to be determined.

The basic recurrent net uses a single hidden state $h$ as a sort of memory for what it has already seen in the previous inputs; this hidden state together with the current input $w_t$ determine the next hidden state and the current output $o_t$:

$$ \begin{align} h_t & = f(U h_{t-1} + V x_t ) \\ o_t & = g(W h_t) \end{align} $$

Activation functions $f$ and $g$ can be any combination of sigmoid, tanh, or other non-linear functions; $g$ may also be ommited. Weight matrices $U,V,W$ are the parameters to be trained e.g. with gradient descent.

Bias

The output of a single layer is often written as

$$ o = f(w \cdot x + b) $$

where the bias $b$ is another parameter to be trained e.g. with gradient descent. However, instead of this additional complication we can ommit this parameter and instead add 1 as another element to the input vector $x$. The parameter $b$ has now become another element in the weight vector $w$. However, in many applications we can ommit the additional constant in the input without any noticeable effect on the network performance.

Unrolling

This architecture begs the question of how we can find derivatives and train the weights when $o_t$ depends not only on the current but also all previous inputs $x_t,x_{t-1},x_{t-2},...$

The concept of unrolling provides the answer, at least conceptually:

In the application phase the inputs and hidden states change inside the RNN while the weights stay the same. This provides the clue for how to train the network: unroll to a specified number of levels, compute the derivatives and train e.g. with gradient descent. The actual implementation may be different, but the end result is the same: we can train recurrent neural networks in a similar fashion to feedforward networks.

Deep Learning

With the advent of powerful and affordable graphics cards that can act as general-purpose GPUs even on entry-level desktop computers recent years saw a renewed interest in connectionist approaches to machine learning; the term deep learning is commonly used when more than two layers are involved.

However, there is more to the deep learning movement that revived the somewhat dormant field of connectionist learning that had already started in the 1980s but lost some of its momentum due to

  • the limited computing power of the time
  • the lack of large scale datasets for training

Both are now abundant, at least compared to the situation 30 years ago. In addition, there have also been some decisive advances in the theory of connectionist machine learning, among them

  • stochastic gradient descent
  • new architectures, such as LSTM
  • dropout

It was found that in order to train a neural net is is not necessary to go through all observations before another weight update; even a small sample usually provides enough data for the gradient. This stochastic gradient descent results in vastly improved speed on large datasets.

Training recurrent architectures used to be extremely costly in termn of computing power, and the basic recurrent neural net suffers from major drawbacks, most of all 'forgetting' earlier input tokens that are decisive for the final output. Newer architectures such as the LSTM improve on the basic model and result in much improved performance even on long input sequences as they 'learn to forget' the unimportant input tokens.

The problem of generalization of learned weights to unseen data has been tackled with the idea of dropout i.e. setting a certain fraction of the neuron activation to zero, thereby forcing the net to encode information in a more 'general' fashion that applies more easily to unseen data, again resulting in significant improvements.

All these advances together again ignited interest in the connectionist approach in the first decade of the new millenium, with sometimes spectacular results, especially in image processing.

With more complex network architectures both the derivatives and the implementation become a little to involved for us to study and code in detail. Here we will be satisfied to use a package that does all the footwork for us: in this case, Keras.

Keras

There are many options for implementing neural nets, and many packages to help us with this task. Keras is a popular choice. Like all frameworks, it has both advantages and drawbacks:

  • Keras allows us to specify the architecture, and the derivaties are computed automatically
  • it supports various backends i.e. software that performs the computation, such as
    • Theano
    • tensorflow
  • Keras provides a number of building blocks for neural nets, including recurrent architectures
  • we have to stay within the confines of the types of architectures supported
  • or we have to implement our own additions, which can be very tricky

If we want to study new types of architectures another approach is more appropriate, such as using Theano or tensorflow directly. In our case here this will not be necessary.

Sentiment Detection

A very common task in machine learning is the automatic detection of sentiment in reasonably short texts, such as product reviews or social media postings. This task lends itself well to recurrent nets since the number of input words is variable; the net has to learn which words are relevant to the sentiment, and which words to ignore. Recurrent nets are surprisingly good at this type of task.

Our dataset can be found on the UCI Machine Learning repository:

https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Download the ZIP file, extract the file imdb_labelled.txt, and upload it to the notebook server, or put it into your current directory.

Now we are read to import our data:

In [161]:
import numpy as np

data = [ line.split() for line in open('imdb_labelled.txt').readlines() ]
#  '.' at end of last word in each sentence, remove with strip()
sents = [ [ w.strip('.,') for w in line[:-1] ] for line in data ]
y = [ int(line[-1]) for line in data ]
print(sents[:10])
print('mean sentence length:', np.mean([len(s) for s in sents]))
print(y[:10])
print(len(y), sum(y))
[['A', 'very', 'very', 'very', 'slow-moving', 'aimless', 'movie', 'about', 'a', 'distressed', 'drifting', 'young', 'man'], ['Not', 'sure', 'who', 'was', 'more', 'lost', '-', 'the', 'flat', 'characters', 'or', 'the', 'audience', 'nearly', 'half', 'of', 'whom', 'walked', 'out'], ['Attempting', 'artiness', 'with', 'black', '&', 'white', 'and', 'clever', 'camera', 'angles', 'the', 'movie', 'disappointed', '-', 'became', 'even', 'more', 'ridiculous', '-', 'as', 'the', 'acting', 'was', 'poor', 'and', 'the', 'plot', 'and', 'lines', 'almost', 'non-existent'], ['Very', 'little', 'music', 'or', 'anything', 'to', 'speak', 'of'], ['The', 'best', 'scene', 'in', 'the', 'movie', 'was', 'when', 'Gerardo', 'is', 'trying', 'to', 'find', 'a', 'song', 'that', 'keeps', 'running', 'through', 'his', 'head'], ['The', 'rest', 'of', 'the', 'movie', 'lacks', 'art', 'charm', 'meaning', 'If', "it's", 'about', 'emptiness', 'it', 'works', 'I', 'guess', 'because', "it's", 'empty'], ['Wasted', 'two', 'hours'], ['Saw', 'the', 'movie', 'today', 'and', 'thought', 'it', 'was', 'a', 'good', 'effort', 'good', 'messages', 'for', 'kids'], ['A', 'bit', 'predictable'], ['Loved', 'the', 'casting', 'of', 'Jimmy', 'Buffet', 'as', 'the', 'science', 'teacher']]
mean sentence length: 14.355
[0, 0, 0, 0, 1, 0, 0, 1, 0, 1]
1000 500

We can see that

  • the sentences have been split into words
  • there are 1000 observations
  • half of them are positive

As a first approach we encode each word with its position in the list of know words:

In [162]:
from collections import defaultdict

cnt = defaultdict(int)

for s in sents:
    for w in s:
        cnt[w] += 1
        
voc = [ w for w in cnt if cnt[w] > 5 ]
print('voc len:', len(voc))
X = [ [ voc.index(w) if w in voc else 0 for w in sent ] for sent in sents ]
print(X[0])
voc len: 334
[0, 1, 1, 1, 0, 0, 2, 3, 4, 0, 0, 0, 5]

Note that the three words 'very' in the first sentence are mapped to the same number.

For the RNN implementation in Keras we need to pad the sequences:

In [163]:
from keras.preprocessing.sequence import pad_sequences

maxlen=20
X = np.asarray(pad_sequences(X, maxlen=maxlen))
print(X[0])
print(X.shape)
[0 0 0 0 0 0 0 0 1 1 1 0 0 2 3 4 0 0 0 5]
(1000, 20)

Let us try and train a simple RNN using Keras.

  • the model is sequential in the sense that there a stack of layers where each layer has one input and one output vector
  • the SimpleRNN building allows us to define the length of the hidden state
  • the result from the RNN unit goes in to a fully connected final layer called Dense
  • the compile function actually creates executable code for the particular CPU, resulting improved performance over a pure-Python implementation
  • we can also specify various optimizers and metrics
  • the SimpleRNN expects 3-dimensional input, which is the only reason for the reshape()
  • Keras also takes care of the training/testing split, here called training/validation

The last point corresponds to the fact that since we are probably fiddling quite a bit with the parameters before we are satisfied with the results that in itself is another training phase; to properly report results we should apply the trained and validated net to a separated test set, and then change the parameters and options no more.

In [164]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense

model = Sequential()
model.add(SimpleRNN(units=64, input_shape=(1,maxlen), activation="sigmoid"))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Xr = np.reshape(X, (X.shape[0], 1, X.shape[1]))
model.fit(Xr, y, validation_split=0.2, epochs=10, batch_size=32)
Model: "sequential_49"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
simple_rnn_34 (SimpleRNN)    (None, 64)                5440      
_________________________________________________________________
dense_44 (Dense)             (None, 1)                 65        
=================================================================
Total params: 5,505
Trainable params: 5,505
Non-trainable params: 0
_________________________________________________________________
None
Train on 800 samples, validate on 200 samples
Epoch 1/10
800/800 [==============================] - 1s 2ms/step - loss: 0.3606 - accuracy: 0.5188 - val_loss: 0.3589 - val_accuracy: 0.4350
Epoch 2/10
800/800 [==============================] - 1s 742us/step - loss: 0.2805 - accuracy: 0.5225 - val_loss: 0.2798 - val_accuracy: 0.4400
Epoch 3/10
800/800 [==============================] - 1s 880us/step - loss: 0.2514 - accuracy: 0.5487 - val_loss: 0.2704 - val_accuracy: 0.5000
Epoch 4/10
800/800 [==============================] - 1s 858us/step - loss: 0.2417 - accuracy: 0.5863 - val_loss: 0.2699 - val_accuracy: 0.5150
Epoch 5/10
800/800 [==============================] - 1s 758us/step - loss: 0.2340 - accuracy: 0.6012 - val_loss: 0.2697 - val_accuracy: 0.4950
Epoch 6/10
800/800 [==============================] - 1s 744us/step - loss: 0.2286 - accuracy: 0.6375 - val_loss: 0.2703 - val_accuracy: 0.4850
Epoch 7/10
800/800 [==============================] - 1s 630us/step - loss: 0.2243 - accuracy: 0.6513 - val_loss: 0.2711 - val_accuracy: 0.4900
Epoch 8/10
800/800 [==============================] - 1s 865us/step - loss: 0.2210 - accuracy: 0.6762 - val_loss: 0.2697 - val_accuracy: 0.4950
Epoch 9/10
800/800 [==============================] - 1s 749us/step - loss: 0.2181 - accuracy: 0.6712 - val_loss: 0.2710 - val_accuracy: 0.4850
Epoch 10/10
800/800 [==============================] - 1s 628us/step - loss: 0.2147 - accuracy: 0.6900 - val_loss: 0.2690 - val_accuracy: 0.4700
Out[164]:
<keras.callbacks.callbacks.History at 0x7f32451fc490>

We are looking at the validation accuracy as the important part of the reports.

As expected, this net is not learning much. However, Keras provides a number of interesting building blocks that allow us to improve on our performance easily.

Embeddings

Words carry meaning depending on their context; with proper learning procedures this meaning in context can be used to encode each word in a numeric vector that has some useful properties, such as

  • words with similar meanings map to similar numeric vectors
  • the encoding of a sentence can be derived by simply adding the word encodings
  • some relationship are also encoded up to a certain degree, such as capital-of:
    • embedding['France'] - embedding['Paris'] is similar to
    • embedding['Italy'] - embedding['Rome']

Word embeddings need to be trained on a very large text corpus in order for these properties to materialize. Fortunately, a number of trained embeddings are available for download from various sources, such as

While this is useful in many applications, there is another approach: learning the embedding layer for a specific application as part of the general training process.

Learning Embedding Layer

Here the embedding vectors are just another set of parameters that are subject to the training phase, exactly like the weights between the layers. Keras provides a building block for this type of embedding:

In [165]:
model = Sequential()
model.add(Embedding(len(voc)+1, 50, input_length=maxlen))
model.add(SimpleRNN(units=64, input_shape=(1,maxlen), activation="sigmoid"))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X, y, validation_split=0.2, epochs=10, batch_size=32)
Model: "sequential_50"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_16 (Embedding)     (None, 20, 50)            16750     
_________________________________________________________________
simple_rnn_35 (SimpleRNN)    (None, 64)                7360      
_________________________________________________________________
dense_45 (Dense)             (None, 1)                 65        
=================================================================
Total params: 24,175
Trainable params: 24,175
Non-trainable params: 0
_________________________________________________________________
None
/home/student/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Train on 800 samples, validate on 200 samples
Epoch 1/10
800/800 [==============================] - 4s 5ms/step - loss: 0.2672 - accuracy: 0.5125 - val_loss: 0.2441 - val_accuracy: 0.5700
Epoch 2/10
800/800 [==============================] - 4s 4ms/step - loss: 0.2515 - accuracy: 0.4863 - val_loss: 0.2533 - val_accuracy: 0.4300
Epoch 3/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2468 - accuracy: 0.5512 - val_loss: 0.2471 - val_accuracy: 0.5750
Epoch 4/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2432 - accuracy: 0.5625 - val_loss: 0.2460 - val_accuracy: 0.5000
Epoch 5/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2385 - accuracy: 0.6125 - val_loss: 0.2533 - val_accuracy: 0.4900
Epoch 6/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2276 - accuracy: 0.6275 - val_loss: 0.2360 - val_accuracy: 0.5850
Epoch 7/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2190 - accuracy: 0.6425 - val_loss: 0.2363 - val_accuracy: 0.5450
Epoch 8/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2106 - accuracy: 0.6637 - val_loss: 0.2424 - val_accuracy: 0.5100
Epoch 9/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2060 - accuracy: 0.6637 - val_loss: 0.2357 - val_accuracy: 0.5850
Epoch 10/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2000 - accuracy: 0.6700 - val_loss: 0.2334 - val_accuracy: 0.6200
Out[165]:
<keras.callbacks.callbacks.History at 0x7f324515f290>

With the learned embeddings even the SimpleRNN performs somewhat better.

LSTM

The long-short term memory architecture has brought some impressive successes to the deep learning approach. Built on the basic RNN this version of a recurrent network adds some more hidden states:

  • a forget gate decides on what to throw away,
  • an input gate decides on what to update, and
  • an output gate decides on what to output

From the current input and the previous hidden state the values for the gates and the new candidate values for the cell state $C$ are computed:

$$ \begin{align} f_t & = \sigma ~ (W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t & = \sigma ~ (W_i \cdot [h_{t-1}, x_t] + b_i) \\ o_t & = \sigma ~ (W_o [ h_{t-1}, x_t] + b_o) \\ \tilde{C} & = \tanh (W_C \cdot [h_{t-1}, x_t] + b_C) \end{align} $$

The new cell state $C_t$ is computed by 'forgetting' part of the previous state $C_{t-1}$ and (based on the current input) adding part of the candidate values $\tilde{C}_t$:

$$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$

The new hidden state is based on the cell state and the current output:

$$ h_t = o_t * \tanh (C_t) $$

We should expect improvements in our sentiment detection for the LSTM:

In [167]:
from keras.layers import Embedding, LSTM

model = Sequential()
model.add(Embedding(len(voc)+1, 50, input_length=maxlen))
model.add(LSTM(64, dropout=0.2, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X, y, validation_split=0.2, epochs=10, batch_size=64)
Model: "sequential_52"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_18 (Embedding)     (None, 20, 50)            16750     
_________________________________________________________________
lstm_14 (LSTM)               (None, 64)                29440     
_________________________________________________________________
dense_46 (Dense)             (None, 1)                 65        
=================================================================
Total params: 46,255
Trainable params: 46,255
Non-trainable params: 0
_________________________________________________________________
None
/home/student/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Train on 800 samples, validate on 200 samples
Epoch 1/10
800/800 [==============================] - 5s 6ms/step - loss: 0.2504 - accuracy: 0.4963 - val_loss: 0.2519 - val_accuracy: 0.4300
Epoch 2/10
800/800 [==============================] - 4s 5ms/step - loss: 0.2494 - accuracy: 0.5175 - val_loss: 0.2536 - val_accuracy: 0.4300
Epoch 3/10
800/800 [==============================] - 4s 4ms/step - loss: 0.2503 - accuracy: 0.4787 - val_loss: 0.2503 - val_accuracy: 0.4550
Epoch 4/10
800/800 [==============================] - 4s 5ms/step - loss: 0.2477 - accuracy: 0.5188 - val_loss: 0.2574 - val_accuracy: 0.4300
Epoch 5/10
800/800 [==============================] - 4s 5ms/step - loss: 0.2464 - accuracy: 0.5763 - val_loss: 0.2481 - val_accuracy: 0.5100
Epoch 6/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2419 - accuracy: 0.6463 - val_loss: 0.2467 - val_accuracy: 0.5200
Epoch 7/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2376 - accuracy: 0.6263 - val_loss: 0.2430 - val_accuracy: 0.5650
Epoch 8/10
800/800 [==============================] - 3s 4ms/step - loss: 0.2308 - accuracy: 0.6575 - val_loss: 0.2363 - val_accuracy: 0.6150
Epoch 9/10
800/800 [==============================] - 4s 5ms/step - loss: 0.2233 - accuracy: 0.7050 - val_loss: 0.2378 - val_accuracy: 0.5650
Epoch 10/10
800/800 [==============================] - 4s 5ms/step - loss: 0.2143 - accuracy: 0.6725 - val_loss: 0.2263 - val_accuracy: 0.6500
Out[167]:
<keras.callbacks.callbacks.History at 0x7f3243bb6d50>

And indeed the LSTM performs somewhat better than the SimpleRNN.

EXERCISES:

  • Find more datasets for sentiment detection, or any other classification problem
  • Apply the code to the new dataset
  • Modify the code a little, e.g.
    • Add more layers
    • Change the parameters
  • Observe and document the results
In [ ]: