As an example of one of the many advanced neural net architectures we will now look at the recurrent network. I this approach the current input signal has a sequence or time index, e.g. the current word in a text such as a short product review, whose sentiment is to be determined.
The basic recurrent net uses a single hidden state $h$ as a sort of memory for what it has already seen in the previous inputs; this hidden state together with the current input $w_t$ determine the next hidden state and the current output $o_t$:
$$ \begin{align} h_t & = f(U h_{t-1} + V x_t ) \\ o_t & = g(W h_t) \end{align} $$Activation functions $f$ and $g$ can be any combination of sigmoid, tanh, or other non-linear functions; $g$ may also be ommited. Weight matrices $U,V,W$ are the parameters to be trained e.g. with gradient descent.
The output of a single layer is often written as
$$ o = f(w \cdot x + b) $$where the bias $b$ is another parameter to be trained e.g. with gradient descent. However, instead of this additional complication we can ommit this parameter and instead add 1 as another element to the input vector $x$. The parameter $b$ has now become another element in the weight vector $w$. However, in many applications we can ommit the additional constant in the input without any noticeable effect on the network performance.
This architecture begs the question of how we can find derivatives and train the weights when $o_t$ depends not only on the current but also all previous inputs $x_t,x_{t-1},x_{t-2},...$
The concept of unrolling provides the answer, at least conceptually:
In the application phase the inputs and hidden states change inside the RNN while the weights stay the same. This provides the clue for how to train the network: unroll to a specified number of levels, compute the derivatives and train e.g. with gradient descent. The actual implementation may be different, but the end result is the same: we can train recurrent neural networks in a similar fashion to feedforward networks.
With the advent of powerful and affordable graphics cards that can act as general-purpose GPUs even on entry-level desktop computers recent years saw a renewed interest in connectionist approaches to machine learning; the term deep learning is commonly used when more than two layers are involved.
However, there is more to the deep learning movement that revived the somewhat dormant field of connectionist learning that had already started in the 1980s but lost some of its momentum due to
Both are now abundant, at least compared to the situation 30 years ago. In addition, there have also been some decisive advances in the theory of connectionist machine learning, among them
It was found that in order to train a neural net is is not necessary to go through all observations before another weight update; even a small sample usually provides enough data for the gradient. This stochastic gradient descent results in vastly improved speed on large datasets.
Training recurrent architectures used to be extremely costly in termn of computing power, and the basic recurrent neural net suffers from major drawbacks, most of all 'forgetting' earlier input tokens that are decisive for the final output. Newer architectures such as the LSTM improve on the basic model and result in much improved performance even on long input sequences as they 'learn to forget' the unimportant input tokens.
The problem of generalization of learned weights to unseen data has been tackled with the idea of dropout i.e. setting a certain fraction of the neuron activation to zero, thereby forcing the net to encode information in a more 'general' fashion that applies more easily to unseen data, again resulting in significant improvements.
All these advances together again ignited interest in the connectionist approach in the first decade of the new millenium, with sometimes spectacular results, especially in image processing.
With more complex network architectures both the derivatives and the implementation become a little to involved for us to study and code in detail. Here we will be satisfied to use a package that does all the footwork for us: in this case, Keras.
There are many options for implementing neural nets, and many packages to help us with this task. Keras is a popular choice. Like all frameworks, it has both advantages and drawbacks:
If we want to study new types of architectures another approach is more appropriate, such as using Theano or tensorflow directly. In our case here this will not be necessary.
A very common task in machine learning is the automatic detection of sentiment in reasonably short texts, such as product reviews or social media postings. This task lends itself well to recurrent nets since the number of input words is variable; the net has to learn which words are relevant to the sentiment, and which words to ignore. Recurrent nets are surprisingly good at this type of task.
Our dataset can be found on the UCI Machine Learning repository:
https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
Download the ZIP file, extract the file imdb_labelled.txt, and upload it to the notebook server, or put it into your current directory.
Now we are read to import our data:
import numpy as np
data = [ line.split() for line in open('imdb_labelled.txt').readlines() ]
# '.' at end of last word in each sentence, remove with strip()
sents = [ [ w.strip('.,') for w in line[:-1] ] for line in data ]
y = [ int(line[-1]) for line in data ]
print(sents[:10])
print('mean sentence length:', np.mean([len(s) for s in sents]))
print(y[:10])
print(len(y), sum(y))
We can see that
As a first approach we encode each word with its position in the list of know words:
from collections import defaultdict
cnt = defaultdict(int)
for s in sents:
for w in s:
cnt[w] += 1
voc = [ w for w in cnt if cnt[w] > 5 ]
print('voc len:', len(voc))
X = [ [ voc.index(w) if w in voc else 0 for w in sent ] for sent in sents ]
print(X[0])
Note that the three words 'very' in the first sentence are mapped to the same number.
For the RNN implementation in Keras we need to pad the sequences:
from keras.preprocessing.sequence import pad_sequences
maxlen=20
X = np.asarray(pad_sequences(X, maxlen=maxlen))
print(X[0])
print(X.shape)
Let us try and train a simple RNN using Keras.
The last point corresponds to the fact that since we are probably fiddling quite a bit with the parameters before we are satisfied with the results that in itself is another training phase; to properly report results we should apply the trained and validated net to a separated test set, and then change the parameters and options no more.
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense
model = Sequential()
model.add(SimpleRNN(units=64, input_shape=(1,maxlen), activation="sigmoid"))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Xr = np.reshape(X, (X.shape[0], 1, X.shape[1]))
model.fit(Xr, y, validation_split=0.2, epochs=10, batch_size=32)
We are looking at the validation accuracy as the important part of the reports.
As expected, this net is not learning much. However, Keras provides a number of interesting building blocks that allow us to improve on our performance easily.
Words carry meaning depending on their context; with proper learning procedures this meaning in context can be used to encode each word in a numeric vector that has some useful properties, such as
Word embeddings need to be trained on a very large text corpus in order for these properties to materialize. Fortunately, a number of trained embeddings are available for download from various sources, such as
While this is useful in many applications, there is another approach: learning the embedding layer for a specific application as part of the general training process.
Here the embedding vectors are just another set of parameters that are subject to the training phase, exactly like the weights between the layers. Keras provides a building block for this type of embedding:
model = Sequential()
model.add(Embedding(len(voc)+1, 50, input_length=maxlen))
model.add(SimpleRNN(units=64, input_shape=(1,maxlen), activation="sigmoid"))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X, y, validation_split=0.2, epochs=10, batch_size=32)
With the learned embeddings even the SimpleRNN performs somewhat better.
The long-short term memory architecture has brought some impressive successes to the deep learning approach. Built on the basic RNN this version of a recurrent network adds some more hidden states:
From the current input and the previous hidden state the values for the gates and the new candidate values for the cell state $C$ are computed:
$$ \begin{align} f_t & = \sigma ~ (W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t & = \sigma ~ (W_i \cdot [h_{t-1}, x_t] + b_i) \\ o_t & = \sigma ~ (W_o [ h_{t-1}, x_t] + b_o) \\ \tilde{C} & = \tanh (W_C \cdot [h_{t-1}, x_t] + b_C) \end{align} $$The new cell state $C_t$ is computed by 'forgetting' part of the previous state $C_{t-1}$ and (based on the current input) adding part of the candidate values $\tilde{C}_t$:
$$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$The new hidden state is based on the cell state and the current output:
$$ h_t = o_t * \tanh (C_t) $$We should expect improvements in our sentiment detection for the LSTM:
from keras.layers import Embedding, LSTM
model = Sequential()
model.add(Embedding(len(voc)+1, 50, input_length=maxlen))
model.add(LSTM(64, dropout=0.2, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X, y, validation_split=0.2, epochs=10, batch_size=64)
And indeed the LSTM performs somewhat better than the SimpleRNN.
EXERCISES: