Build a Large Language Model (From Scratch)

Notes on Sebastian Raschka's book.

Chapter 2: Working with text data

Neural networks work with numeric data rather than raw text.
This means LLMs need a component that converts text into numeric representations, such as embeddings.
Because LLMs are next-word predictors (more precisely, next-token predictors), training and testing data must be formatted as (input-tokens, next-token) pairs derived from sample text.
There are a few main steps involved in converting text into LLM-ready numeric data:
- Tokenizing text
  - This is similar to tokenization in compilers and interpreters.
  - Text can be tokenized in different ways: by word, by character, or by a hybrid of words and characters.
  - These tokens are then assigned numeric IDs based on a vocabulary of possible tokens.
  - Modern LLMs commonly use subword tokenization schemes such as BPE (byte-pair encoding) or close variants of it (e.g. tiktoken).
  - There are several design choices here, including how to encode punctuation and whitespace, and how to represent the start and end of a piece of text.
- Preparing training/testing pairs
  - A sliding window approach is typically used. During training, the target is often not just a single next token, but a vector of tokens with the same length as the input, shifted one token to the right. This lets the model learn next-token prediction at every position in parallel.
  - Intuitively, we use a target vector rather than a single token so the model can produce a training signal for every position in the input sequence, not just the last one.
  - The input-tokens vector has the same size across all generated training/testing pairs. This sequence length is the context size. In practice, the model architecture defines a maximum context length, and training samples are usually chosen to fit within that limit.
- Generating token embeddings
  - The final step is to generate an embedding for every token (e.g. a numeric vector that encodes "meaning").
  - These embeddings are initialized from random weights and are then adjusted during training.
  - In addition to token embeddings, we also need a way to encode a token's position in the sequence. Otherwise, the same token would always have the same embedding regardless of where it appears.
    - There are two main ways to do this:
      - Absolute positional embeddings: add a positional embedding to each token embedding (e.g. position X in a sequence gets a corresponding trained positional embedding E_x).
      - Relative positional embeddings: encode the relative distance between tokens. In this approach, absolute position matters less than how far apart two tokens are. This is part of the self-attention mechanism rather than the embeddings themselves.
  - These token embeddings form the model's initial input representations, before the transformer layers process them.
    - At the embedding stage, a token's learned embedding is tied to the token itself. If absolute positional embeddings are added, the combined input representation depends on both the token and its position. With relative positional approaches, positional information is typically introduced later through the attention mechanism.

Chapter 3: Coding attention mechanisms

Attention mechanisms are used to represent the surrounding context of a token.
- For example, consider two sentences that both contain the word "dog": "The dog sits on the couch" and "While walking home, I was bitten by a dog." Even though the same word appears in both, the surrounding context is different. Self-attention helps the model account for that contextual information.
The goal of self-attention is to generate a context vector for every input embedding: a vector that represents the meaning of that embedding in the context of the full sequence.
- At a high level, computing the context vector for an input embedding within a sequence involves the following steps:
  - Compute attention scores
    - For every pair of input embeddings (i, j), compute a score that represents how much attention to pay to input embedding j when processing input embedding i.
    - This produces a matrix of size (d_seq, d_seq), where d_seq is the sequence length/dimension.
  - Normalize attention scores
    - Adjust the attention scores so that every row of the attention matrix sums to 1.
    - This normalization is usually done with the softmax function.
  - Compute context vectors
    - For each input embedding i, use the normalized attention scores over all input embeddings j to compute a weighted sum of their value vectors. Conceptually, this looks like sum(attention_scores(i, j) * value_j), where the sum is taken over all positions j.
    - In simple self-attention, this can be written as a matrix multiplication such as attention_weights * input_embeddings. If the attention matrix has size (d_seq, d_seq) and the value/input matrix has size (d_seq, d_emb), the resulting context matrix has size (d_seq, d_emb).
Modern LLM implementations use trainable weights for the self-attention mechanism.
- This involves three weight matrices that are adjusted during training: W_query, W_key, and W_value.
- Generally, these matrices have a size of (d_emb, d_emb), but they can also map into a different size such as (d_emb, d_ctx).
- The input embeddings are projected using the query, key, and value weight matrices: Q = X * W_query, K = X * W_key, and V = X * W_value.
- Attention scores are computed using the projected query and key embeddings. Conceptually, this can be written as Q * transpose(K), which corresponds to taking the dot product between the query projection of the input embedding and the key projection of every other embedding.
- The attention scores are then normalized using the softmax function to obtain the attention weights.
  - Before normalization, the attention scores are adjusted by the square root of the key dimension. This is done so the dot products do not grow too large and hurt training stability. In practice, normalization is done using: softmax(attention_scores / sqrt(d_k)).
- Finally, the context vectors are computed using the value projection: attention_weights * V.
Since LLMs are next-token predictors, they should only have access to past tokens and not future tokens.
- As such, the self-attention logic is modified so that the model only considers past tokens when computing context vectors. This is called causal attention.
- Conceptually, causal attention means paying zero attention to future tokens during training.
- This means the attention score matrix is masked above the main diagonal. In practice, these entries are usually set to minus infinity before applying softmax, because e^-inf -> 0.
Another technique used in self-attention mechanisms is dropout, which helps avoid overfitting.
- In this context, dropout implies randomly masking some attention weights during training.
- This spreads out information across the underlying neural network and avoids specific neurons controlling too much.
One last improvement is multi-head attention.
- Rather than using a single self-attention pass, multi-head attention involves multiple independent runs in parallel.
- This results in multiple context vectors that are then combined.
- These context vectors can encode different types of patterns or relationships. For example, in a sentence such as "the brown dog is sleeping," one attention head might pay more attention to the verb "sleeping," while another head might pay more attention to the adjective "brown."
- If n_heads is the number of parallel self-attention heads, each head usually produces a context vector of length d_ctx / n_heads so that, after combining the outputs from all heads, the final context vector has length d_ctx.
The explanations above describe a single attention layer, but these layers can be stacked on top of one another. Chapter 1 mentioned that GPT-3 had 96 transformer layers. Different layers may learn to pay attention to different patterns or structures in the input sequences, and deeper understanding may only emerge deeper within the stack.
- This is one reason d_ctx is usually set to match d_emb: the resulting context vectors from one layer are treated as inputs to the next layer.

Chapter 4: Implementing a GPT model from scratch to generate text

At its core, a GPT model consists of the following components, in this order:
- token embedding layer, which maps token IDs to learned embedding vectors, plus positional embeddings if the model uses absolute positional embeddings
  - this step has dropout applied to the resulting embedding activations (e.g. randomly zeroing out values)
- multiple transformer blocks
- normalization layer
  - normalization layers are also used internally within the transformer blocks
- linear layer, which projects from embedding space back into vocabulary space
The normalization layer helps with training stability in the underlying neural network (e.g. helping the neural network efficiently adjust its weights in order to learn the patterns in the data).
- Layer normalization normalizes each token's hidden vector independently across its features, so the values for that token have a mean of 0 and a variance of 1.
- In practice, layer normalization includes two trainable parameters, scale and shift, and the resulting normalization formula is: scale * (x - mean(x)) / sqrt(var(x) + 0.00001) + shift.
A transformer block also includes a feed-forward network.
- This consists of two linear layers combined with an activation function, such as GELU.
- The first linear layer expands the feature dimension, the activation function is applied, and the second linear layer contracts the feature dimension again. This means the embeddings matrix goes from (d_seq, d_emb) to (d_seq, X * d_emb) and then back to (d_seq, d_emb).
- Expanding the feature dimension allows the model to explore a richer representation space.
To help with the vanishing gradient problem, the transformer block uses a technique called shortcut connections.
- This involves adding a layer's input back to its output.
Having introduced the previous concepts, it's time to go deeper into the underlying components of a transformer block, which are the following:
- normalization layer
- causal multi-head attention, which produces one context vector for every input token
- dropout
- shortcut connection, which adds the input to the attention sublayer back to its dropout output
- another normalization layer
- feed-forward network
- dropout
- another shortcut connection, which adds the input to the feed-forward sublayer back to its dropout output
As mentioned previously, transformer blocks preserve the dimensions of the data they're working with. This makes it easy to link multiple such blocks together, which is what GPT models do.
The last step is to map the model outputs back into tokens. The final linear layer projects from embedding space into vocabulary space, producing logits. Each row contains logits for the next token at that position; applying softmax converts those logits into a probability distribution over the vocabulary.
- In practice, various sampling techniques are used so that the highest-probability token is not always selected; this introduces variability and diversity into the model's output.