A Tiny Large Language Model¶

We train a very basic neural network to predict the next word in The Tale of Peter Rabbit.

https://gutenberg.org/ebooks/14838

In [381]:
text = open('pg14838.txt').read()
text = text[ text.find('Once upon a time') : text.find('THE END') ]
text = text.replace('[Illustration]', '')
print(text[:434])
Once upon a time there were four little Rabbits, and their names
were--

          Flopsy,
       Mopsy,
   Cotton-tail,
and Peter.

They lived with their Mother in a sand-bank, underneath the root of a
very big fir-tree.

'Now my dears,' said old Mrs. Rabbit one morning, 'you may go into
the fields or down the lane, but don't go into Mr. McGregor's garden:
your Father had an accident there; he was put in a pie by Mrs.
McGregor.'

Tokenisation¶

We keep things very simple and define words as sequences of letters. We go to lower case to reduce the vocabulary size.

In [382]:
import re

toks = re.findall(r'\w+', text.lower())
print(len(toks))
print(toks[:28])
975
['once', 'upon', 'a', 'time', 'there', 'were', 'four', 'little', 'rabbits', 'and', 'their', 'names', 'were', 'flopsy', 'mopsy', 'cotton', 'tail', 'and', 'peter', 'they', 'lived', 'with', 'their', 'mother', 'in', 'a', 'sand', 'bank']

The number of unique words is high compared to the number of tokens which is to be expected for a short story.

In [383]:
voc = list(set(toks))
print(len(voc))
print(voc[:10])
384
['some', 'hide', 'away', 'frame', 'quite', 'a', 'they', 'plants', 'idea', 'still']

The sklearn implementation for the MLP (multi-layer perceptron i.e. artificial neural network) expects numbered labels for training outputs. We use the index in the vocabulary.

In [384]:
y = [ voc.index(t) for t in toks ]

print(toks[:9])
print(y[:9])
['once', 'upon', 'a', 'time', 'there', 'were', 'four', 'little', 'rabbits']
[28, 234, 5, 278, 206, 265, 260, 243, 228]

Word Encodings¶

We use one-hot vectors to encode words: 1 at the position of the current word in the vocabulary, 0 otherwise.

In [385]:
def txtenc(lst):
    enc = [ [ int(x == w) for w in voc ] for x in lst ]
    return np.array([x for xs in enc for x in xs])

Using a very small vocabulary we can see how this works:

In [386]:
smallvoc = ['fields', 'lane', 'into', 'down', 'the']
sents = [ ['into', 'the', 'fields'], ['down', 'the', 'lane'] ]
[ [ [ int(x == w) for w in smallvoc ] for x in s ] for s in sents ]
Out[386]:
[[[0, 0, 1, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0]],
 [[0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 1, 0, 0, 0]]]

Training Data Format¶

Now we are ready to prepare our training data:

  • the encoding for a window of words
  • the next word, as index in the vocabulary

The window size is the number of preceedings words. For us this will also be the size of the prompt.

In [387]:
import numpy as np
import warnings
warnings.filterwarnings('ignore')

X = []
y = []
w = 5

for i in range(len(toks)-w-1):
    X += [ txtenc(toks[i:i+w]) ]
    y += [ voc.index(toks[i+w]) ]
X = np.asarray(X)
y = np.asarray(y)
print(X.shape)
print(y.shape)
(969, 1920)
(969,)
In [388]:
print(toks[:5])
print([ voc[y[i]] for i in range(4) ])
['once', 'upon', 'a', 'time', 'there']
['were', 'four', 'little', 'rabbits']

Artificial Neural Nets¶

Here is a very basic net with just one layer of weights. Depending on the input the output is either greater or less than some threshold, such as zero.

In [389]:
inp = [ [ 1, 0, 1 ], [0, 1, 1 ]]
wei = [ 0.2, 0.5, -0.4 ]

print(np.dot(inp, wei))

print([ int(np.dot(i, wei) > 0) for i in inp ])
[-0.2  0.1]
[0, 1]

By choosing the proper weights, we can categorize any observations automatically into classes 0 or 1, such as in the example above.

We can also understand this as predicting one of two possible next words, depending on the previous words.

Training¶

The package sklearn provides some neat functions, such as train/test split: the first 80% are used for training, so we will only ask questions for that part of the story.

In [390]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
print('training size:', X_train.shape[0], 
      'testing size:', X_test.shape[0]) 
training size: 775 testing size: 194

We choose a very basic structure for our network, just one hidden layer with a modest number of units.

In [391]:
import time

h = 25
t0 = time.time()
model = MLPClassifier(random_state=13, max_iter=200,
                    hidden_layer_sizes=(h,)).fit(X_train, y_train)
t1 = time.time()
print('score train:', model.score(X_train, y_train))
print('score test: ', model.score(X_test, y_test))
print('training time:', t1-t0, 'seconds')
score train: 1.0
score test:  0.07216494845360824
training time: 38.905585289001465 seconds

On the training data the network can predict the next word perfectly.

The test score already tells us the extremely bad performance on unseen data.

Trainable parameters:

  • In a fully connected network each unit in a layer is connected to each unit in the next layer;

  • with one hidden layer of size h, input size i, and output size o,

  • we have have (i * h + h) + (h * o + o) trainable parameters (weights and bias values)

In [392]:
i = X.shape[1]
o = len(set(y_train))
print('i:', i)
print('h:', h)
print('o:', o)

print('params:', sum([c.size for c in model.coefs_]) + sum([i.size for i in model.intercepts_]))
print('.. first layer: i * h + h = ', i * h + h)
print('.. second layer: h * o + o = ', h * o + o  )
print('.. sum:', i * h + h + h * o + o)
i: 1920
h: 25
o: 330
params: 56605
.. first layer: i * h + h =  48025
.. second layer: h * o + o =  8580
.. sum: 56605

Prompting¶

In the training part of the text the next word prediction works perfectly, as expected:

In [393]:
def prompt(txt):
    enc = txtenc(txt.split())
    inp = enc.reshape(1, enc.shape[0])
    j = model.predict(inp)
    print(txt, '-->', voc[int(j)])

prompt('once upon a time there')
prompt('you may go into the')
prompt('he was put in a')
once upon a time there --> were
you may go into the --> fields
he was put in a --> pie

Outside the training data the performance is bad:

In [394]:
prompt('and they lived underneath the')
prompt('so they lived underneath the')
and they lived underneath the --> gate
so they lived underneath the --> tool

Nevertheless, depending on the prompt we can still get correct answers:

In [395]:
prompt('the rabbits lived with their')
prompt('the dears lived with their')
the rabbits lived with their --> mother
the dears lived with their --> mother

As was to be expected with such a basic implementation, free questions are hopeless:

In [396]:
prompt('where do the rabbits live')
prompt('how many rabbits are there')
where do the rabbits live --> peter
how many rabbits are there --> was
In [ ]: