How LLMs Work

Note: This article focuses on inference: once a model has already been trained, how it processes text and generates the next token. Training is a different phase, where the model gradually learns its parameters from massive amounts of text. We will cover that in a separate article.

Large Language Models (LLMs) are machine learning models designed to understand and generate human language.

They are complex, but at a high level the overall process is surprisingly readable.

Tokenization

The first step is tokenization. It is the process of converting text into a sequence of tokens.

A token is a unit of text, often a whole word, sometimes a subword. For example, "hello" can be one token, "world" can be one token, but "unbelievable" could be split into three tokens: "un", "believ", and "able".

Given the sentence "Gally is a smart dog", the tokens might look like: "Gally", "is", "a", "smart", "dog".

Step tokenization visualization

Once tokenized, the text is now a sequence of numbers: [4821, 318, 257, 4950, 3290].

Embedding

The next step is embedding. Tokens are just IDs, arbitrary numbers that carry no meaning on their own. They are simply pointers that let the model look up the corresponding entry in its vocabulary.

To make them useful, each token is mapped to a dense vector: a list of hundreds or thousands of floating-point numbers. You can think of it as a coordinate in a high-dimensional space, where position encodes meaning. Each token ID is looked up in the embedding matrix, which returns its corresponding vector.

Step embedding visualization

Words with similar meanings tend to have vectors that are close to each other in the embedding space.

Embedding space visualization

The model also adds positional information to each token embedding, so it knows whether a token appears at the beginning, middle, or end of the sequence. Without this information, "dog bites man" and "man bites dog" would look identical to the model.

Attention

Attention solves a problem that embedding alone cannot: a word by itself is not enough.

"Smart" in "smart dog" does not mean the same thing as "smart" in "smart phone".

In the sentence "Gally is a smart dog", "smart" is directly related to "dog", but also related to "Gally".

Three matrices (W_Q, W_K, W_V) let each token "ask questions" of the others and retrieve the answers, weighted by relevance.

Step attention visualization

After the attention step, the vector for "smart" is different. It is now enriched with information about "dog" (heavily weighted) and "Gally" (moderately weighted).

Feed-Forward

After attention, each vector goes through a small network of matrices called the feed-forward network.

Let’s take Llama 3.1 8B as an example.

The first matrix expands the vector to 14,336 dimensions, and the second compresses it back down to 4,096. This expansion-and-compression process is where much of the model's memorized associations and factual knowledge is encoded: that "dog" is a mammal, that it is domestic, that "Gally" is a first name.

Step feed-forward visualization

These two matrices alone account for roughly 117 million parameters per layer. And it is precisely here that most of what we want to compress lives.

Layers

This block, attention followed by feed-forward, repeats 32 times in a row. With each layer, the vectors become more refined.

Think of it like a photograph developing in a darkroom: the first seconds reveal only rough shapes and contrasts. Details sharpen with each passing moment. By the end, the full image is there.

Layers visualization

Layer after layer, the model keeps transforming the same sequence until each token representation carries enough context to support the next prediction. Across the whole network, this adds up to billions of learned numbers.

Prediction

At the output of the last layer, one final matrix projects the last token's final vector onto the full vocabulary. The result is a probability distribution over roughly 50,000 possible tokens. The model does not "choose" a word in a human sense: it computes how plausible each possible next token is, then a decoding strategy selects one candidate.

Prediction visualization

Suppose "and" is selected. It is added to the sequence, which becomes "Gally is a smart dog and", and the cycle starts again until the next token. A model generates text this way, one token at a time, consulting its billions of parameters at every step.

It is this repeated cost, layer by layer and token by token, that makes model size so critical in practice.

What is easy to forget is just how literal and mechanical this process really is: for every single token, the model runs through all 32 layers, multiplies huge matrices together, and produces the next token. Then it does it again. And again. There is no intuition there, no shortcut, just an enormous amount of arithmetic repeated for every generated token.

This is why model size matters so much: every generated token requires reading and multiplying huge matrices again and again. Once you see that, the next question becomes obvious: how can we store those numbers more efficiently without breaking the model?

That is exactly what quantization tries to solve, and we’ll look at that in a later article.