Blog

How a Language Model Picks the Next Word

· AI, LLM, explainer

Everyone talks about large language models. Very few people can say what one actually does. The honest answer is almost disappointingly simple. An LLM has one job: look at the text so far and guess the next word. Then it adds that word on and guesses again. Whole essays get written one guess at a time.

The interesting part is what happens in between reading your text and making the guess. Five steps, no maths needed. I’ll walk through each one with an everyday analogy.

1. Tokenise: breaking text into pieces

Before the model can do anything, it breaks your text into pieces. It reads a sentence the way a musician reads sheet music, one mark at a time rather than all at once. Each piece is a chunk of text, and the model only knows a fixed set of them. That set is its whole vocabulary, and it can never step outside it. Common words are usually one chunk. Rarer ones get split into smaller fragments it has seen before, so “cat” stays whole while “unhappiness” comes apart into something like “un” and “happiness”.

These pieces are called tokens. From here on the model never sees letters or words again. Only tokens.

2. Embed: turning pieces into positions

Next, each chunk becomes a point in space. Spotify does something similar when it quietly arranges songs so the similar ones sit near each other. Words get arranged the same way. Ones used alike cluster together. Ones with nothing in common drift apart. “King” and “queen” end up as neighbours; “king” and “celery” sit nowhere near each other.

That spacing can be regular enough to do arithmetic with. In the older models that first popularised the idea, you could start at “king”, subtract “man”, add “woman”, and end up near “queen”. The list of coordinates for each token is called its embedding. The giant table of all of them is the embedding matrix.

One catch. This is only a starting position. A token gets the same embedding whatever sentence it turns up in. Working out what it means in this particular sentence is the next step’s job.

3. Attention: meaning from context

A word doesn’t carry one fixed meaning. It settles into one based on the words around it. Picture a juror hearing a single piece of evidence: their read on it shifts as the rest of the testimony comes in. The word “bank” works the same way. On its own it’s a coin toss. Drop it into “I sat on the river bank” and the surrounding words pin down the water meaning, while the money one quietly falls away.

This step is called attention. It is what makes modern models as good as they are. Every token gets to look at every earlier token and adjust itself to fit.

4. Logits: scoring every possible word

Now the model scores every word that could come next. A search engine does the same thing out of sight, rating every page before it hands you a ranked list. Each candidate word gets a number, and a higher number means a better fit. The numbers are raw and a bit messy: one word scores 4.2, the next lands at minus 0.5. Don’t read too much into the values themselves. Only the gaps between them carry meaning.

These raw scores have a name too. Logits.

5. Softmax: turning scores into percentages

Finally, those messy scores turn into clean percentages that add up to 100%. Think of an election. The raw scores are how loudly each candidate’s supporters are cheering; this step turns that cheering into actual vote shares. There’s a twist. The loudest voice gets rewarded extra, so a small lead in cheering swells into a big lead in votes.

One dial controls how dramatic that gets. It’s called temperature. Turn it down and the favourite nearly always wins. Turn it up and the underdogs start getting a real look-in. That one dial is why the same model can sound buttoned-up and predictable one moment, loose and inventive the next.

Try it yourself

Reading about it only gets you so far. Below is the whole pipeline, working. Type a phrase, step through the five stages, and watch the last word travel from plain text to a prediction. On the final stage, drag the temperature dial and watch the percentages shift.

How a language model
picks the next word

Type a phrase, then walk through the five stages. The last word travels from plain text, into meaning, and out into a prediction. The note in each stage is the everyday analogy for what is happening.

Your phrase
The cat sat on theOnce upon aI think therefore I
1. TokeniseText into pieces
2. EmbedPieces into space
3. AttentionMeaning from context
4. LogitsScoring every word
5. SoftmaxScores into percentages
Tokenise
Before the model can do anything, it breaks your text into pieces. It reads a sentence the way a musician reads sheet music, one mark at a time rather than all at once. Each piece is a chunk of text, and the model only knows a fixed, limited set of them, a vocabulary it can never step outside. Common words are usually a single chunk. Rarer ones get split into smaller fragments the model has seen before, so “cat” stays whole while a word like “unhappiness” breaks into pieces such as “un” and “happiness”.
Your phrase as pieces
Thepiece 36550
catpiece 35050
satpiece 9400
onpiece 46350
thepiece 2100

Note: the next-word options here are hand-picked for a few example phrases to keep things readable. A real model weighs up every word in its vocabulary at once. The stages, and the softmax maths behind the percentages, work exactly as shown.

So what is a model, really

Put the five steps back to back and the whole thing reads as one sentence. The model breaks your text into tokens, places each token in a space where meaning is distance, lets the tokens trade context until each one knows what it means here, scores every word that could come next, and turns those scores into percentages to pick from. Then it adds the winning word to your text and runs the whole loop again for the next one.

That’s the whole trick. Everything else, the eye-watering scale, the training, the cost, goes into making those five steps good enough that the guesses start to feel like understanding.