How GenAI works & hot news on Gemma

This week was very hot because Google published Gemma, a smal-size LLM, and also Nvidia has a rush on stock market, with a evaluation around 2000 BILLIONS dollars. Last but not least, published a service for its more advanced model, which can also run on-prem.

So we was forced to prepare this small “interlude” article to keep you informed, following the blogsphere but reduce the hype.

In this article we will describe at higher level how GenAI works, thanks to this very deep article: I suggest you to read it, if you want more details.

Pre trained Generative models (GPT) is a very complex system, which only the last step involve a deep neural network.

The basic idea is that the GPT will be able to predict the next MORE likely term following a prompt. You can then feed again the whole input+output and get a new output2, and go on. This system as an “attention” window which is how much input will be take in consideration to produce the output, and for GPT2 is 1Kb.
It means only the last kb will be taken in consideration, and this is the major limit of these models: increasing the attention window increase its ability to answer and follow our “instructions”, but increase the model size!

The steps are quite complex, so lets try to sketch them a bit.

  1. Tokenizer
  2. Embeddings – The magic art to map narrow values on wider vectorial space
  3. Attention calculation
  4. Neural network feedforward

Lets see one by one.


Tokenizer is a way to “map” the input data in “code points”. GPT2 uses Byte Pair Encoding (BPE) to do it, and map to 50256 code points. So we have our nice vector of code points now.


As said, this steps map the vector of code points in a wider space. GPT2 has 768 dimensions space.

Two matrix are used: one does token embedding and maps the token in the 768 space (WTE), and the other take care of the position (WPE)

WTE is a 50257×768 matrix
WPE is a 1024×768 matrix, which as you see take in account the attention window of length 1024.

These values have be “learned” during GPT2 traning and try to make a relation between words and their position.

We end up summing this two vectors:

# token + positional embeddings
x = wte[inputs] + wpe[range(len(inputs))]  # [n_seq] -> [n_seq, n_embd]


The attention phase is the most complex one. The idea is to try to express a way to “influence” words between them:

To enable this transfer of meaning from one token to another, we need to allow the vectors of all the tokens to influence each other.

Cited from Happy New Year: GPT in 500 lines of SQL – EXPLAIN EXTENDED at EXPLAIN EXTENDED

We can model as a function Attention(Q,K,V) based on 12 set of 3 matrixes called Q (query) ,K (key) ,V (value). The idea is to have something which is easy to differentiate.

Deep Neural Network

The final step is the neural network, which predict the next token based on the previous pre-rpocessing steps

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.