# The Transformer

This approach to connectionist sequence processing is quite new and not yet
clearly defined, but several concepts emerge as essential characteristics:

- feed-forward
- self-attention
- positional encoding

The feed-forward architecture results in huge performance improvements
compared to recurrent designs and therefore the ability to tackle much larger 
problems with given constraints on computing resources. 

Recurrent architecures deal with position naturally by processing elements sequentially, 
and the more advanced approaches like LSTM learn how to recognize important
elements in the input even if they occur early in the sequence.

Transformers combine position encoding to retain sequence order 
with attention to recognize important elements.

A sequence in this context is usually understood as a sequence of words; however, the 
approach is applicable to other ordered elements such as pixels in an image.

Some successful applications of transformer architectures are:

- BERT Bidirectional Encoder Representations from Transformers

  - general-purpose NLP tool: text summarization, question answering, classification etc.
  
  - pre-trained transformer model for fine-tuning on specific NLP tasks

  - Trained on massive text corpus by Google; downloadable BERT-Large: 3.4 billion words, 
    340M parameters 
  
  - some trained models can be downloaded at  https://github.com/google-research/bert
  
  - Devlin et al 2019 https://arxiv.org/abs/1810.04805
    
- GPT-3 Generative Pre-Trained Transformer

  - can generate astonishingly realistic text
  
  - 175 billion parameters, 499 Billion tokens

  - originally developed by OpenAI, now exclusively licensed by Microsoft
  
  - Brown et al 2020 https://arxiv.org/abs/2005.14165
  
- Meena
  
  - multi-turn open-domain chatbot by Google
  
  - trained on huge corpus of social media conversations for low perplexity 
    (reasonable continuation): 2.6 billion parameters, 40 billion words (341 GB corpus)

  - Adiwardana et al 2020 https://arxiv.org/pdf/2001.09977
  
And of course chatGPT, and many others. Generative AI is moving fast, all the Big Tech
try to get a piece of the action.

The approaches are complex and required solid computing resources to even run the trained
models. However, there is a somewhat older download that still illustrates the idea
quite well.

#### Trained GPT Model Download and Use

GPT-2 is available for use in Python applications; 
the following example uses the transformers library (Huggingface).

    pip install transformers

Note that the gpt-2 model is a big download (~800MB).

In [1]:
from transformers import pipeline

# this may take a while..
gpt2_generator = pipeline('text-generation', model='gpt2')

Now we can generate text:

In [2]:
def pred(start):
    txt = gpt2_generator(start, do_sample=True, top_k=50, temperature=0.6, 
                         max_length=128, num_return_sequences=1)
    for x in txt:
      print(x["generated_text"])
    
pred("Feed-forward neural networks are simple and powerful. They can")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Feed-forward neural networks are simple and powerful. They can be built on top of existing neural networks and can be used to create new neural networks. They can be used to train a network on a machine learning model, and they can be used to create new neural networks.

The most recent version of this paper (2010) used the SVM-based neural network model to build a neural network, which is a version of a neural network that has been built on top of a neural network. However, there are some problems with the SVM-based neural network model.

The first problem is that the model can be


The result is somewhat disappointing; let's try again:

In [3]:
pred("The first Bond movie Dr. No was filmed in")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The first Bond movie Dr. No was filmed in the 1970s. It was a movie about a young woman who has never been able to get her hair cut, and who is never going to have the chance to be a doctor.

Advertisement

There's an interesting story about the story of a doctor who has been shot in the shoulder, and is now working at a hospital. The doctor is a young woman who has been in a bad relationship with her husband. He is a man who is the father of his child, and has been in a bad relationship with his wife for years. He is a man who is not


We cannot expect any miracles here. The model was trained on a huge text corpus
and can make suitable guesses in many cases. Its power lies in providing a pre-trained
module that can be used in other applications.

## Attention



In [8]:
import gzip
import numpy as np 
np.set_printoptions(suppress=True, threshold=20, precision=3) 

glove = {}
fn = 'glove.40k.6B.50d.txt.gz'
for line in gzip.open(fn):
    lst = line.decode('utf8').split()
    glove[lst[0]] = np.asarray(lst[1:], dtype='float32')
print(len(glove), glove['italy'].shape, glove['italy'][:3])
embs = np.array([ glove[w] for w in 'we cannot expect any miracles here'.split() ])
print(embs)
np.random.randint(3, size=(3, 3))

40000 (50,) [ 1.77  -0.778 -0.953]
[[ 0.574 -0.327  0.071 ...  0.488 -0.184  0.699]
 [ 0.598 -0.435  0.502 ...  0.332 -0.027  0.069]
 [ 0.319 -0.212  0.607 ...  1.145 -0.246  0.957]
 [ 0.513  0.09   0.024 ...  1.158  0.298  0.075]
 [ 0.975  0.532 -0.817 ...  0.209 -0.448 -0.121]
 [ 0.141  0.682 -0.504 ...  0.111  0.11  -0.271]]


array([[2, 1, 2],
       [1, 1, 1],
       [0, 0, 0]])

In [11]:
from scipy.special import softmax

d = embs.shape[1]
np.random.seed(42)

# weights
W_Q = np.random.randint(3, size=(d, d))
W_K = np.random.randint(3, size=(d, d))
W_V = np.random.randint(3, size=(d, d))

# queries, keys and values
Q = embs @ W_Q
K = embs @ W_K
V = embs @ W_V

# scoring the query vectors against all key vectors
scores = Q @ K.transpose()

# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)

# computing the attention by a weighted sum of the value vectors
attention = weights @ V

print(attention)

[[ 0.377  9.675 -1.402 ...  6.455 -0.64   1.905]
 [ 0.377  9.675 -1.402 ...  6.455 -0.64   1.905]
 [ 0.355  9.67  -1.42  ...  6.436 -0.668  1.897]
 [ 0.377  9.675 -1.402 ...  6.455 -0.64   1.905]
 [-6.305  5.354 -3.247 ...  2.549 -6.573 -3.765]
 [ 0.334  9.664 -1.436 ...  6.418 -0.695  1.889]]


In [16]:
# !pip install transformers==4.22.2
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

inp = ['Does money buy happiness?', 'Who is James Bond?',  'You are no fun.']
for i in range(len(inp)):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(inp[i] + tokenizer.eos_token, 
                                          return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, 
                               new_user_input_ids], 
                              dim=-1) if i > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, 
                                      pad_token_id=tokenizer.eos_token_id)

    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(
        tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], 
                                                 skip_special_tokens=True)))

DialoGPT: Money buys happiness.
DialoGPT: James Bond is a man.
DialoGPT: I'm no fun.
