---
title: "Markov chains: the century-old proof that wrote ChatGPT"
date: 2026-05-15
description: "A century-old mathematical proof drawn from 20,000 letters of Pushkin that today lets every language model predict the next word. The story, the proof step by step, and the bridge to modern LLMs."
tags: ["math", "ai", "llm", "probability", "deep-dive"]
---

## 🚀 Intro

Let's start with something you probably do every day without thinking about it.

You open Gmail, you begin typing "Thanks for...", and the text autocompletes "...your message. I'll get back to you as soon as I can." You pause halfway through a question to ChatGPT, and it's already generating a sensible answer, word by word. You watch the screen as the model picks the next token, then another, then another - and out of this comes something you read like ordinary prose.

Underneath this everyday magic sits a surprisingly old piece of mathematics. It wasn't invented for AI. It wasn't even invented in the twentieth century. Its foundation was laid more than a hundred years ago by a Russian mathematician with a famously bad temper who just wanted to score points against an academic rival. To prove his case, he took 20,000 letters from a poem by Pushkin and started counting vowels.

What he discovered then we now call **Markov chains**. And it's the foundation underneath every large language model, every weather forecast, every Google ranking, and even the first atomic bomb.

![Cover: a manuscript of Eugene Onegin dissolving into LLM tokens, with a subtle state diagram of V and C in the background](/media/markov-cover.png)

In this post I want to do two things. First, walk through Markov's Onegin proof step by step, so that you leave with a full grasp of what he actually showed and why it was revolutionary. Second, take you from that nineteenth-century intuition all the way to modern language models and show that ChatGPT is, at heart, a very sophisticated relative of the same machine Markov sketched in chalk on a blackboard.

You don't need any advanced mathematics. A basic feel for probability, some curiosity, and a willingness to think about text for a moment not as a story but as a sequence of events.


## 📋 TL;DR

- **A Markov chain is a system in which you predict the future by looking only at the current state** - the history that led there is irrelevant. Sounds primitive. In practice it's extraordinarily powerful.
- **Markov proved something nobody wanted to hear:** dependent events also obey the law of large numbers. Observing statistical regularity in data doesn't prove that independence (let alone "free will," as his rival claimed) lies underneath.
- **He worked out the proof on 20,000 letters of Eugene Onegin.** He showed that vowels and consonants don't appear at random, and yet the ratios converge to stable values.
- **Every modern LLM is, at heart, autoregressive prediction of the next token** given the state of the context. It's an evolved version of the same idea, except instead of "vowel after consonant" we have "token after token after 100,000 earlier tokens."
- **What separates an LLM from a classical Markov chain is the attention mechanism.** It lets the model reach far back into context instead of looking only at the most recent state. But the probabilistic foundation is identical.
- **Markov chains had already changed the course of history before:** they sit underneath the Monte Carlo method (the Manhattan Project), the PageRank algorithm (Google), and the mathematics of card shuffling. LLMs are just the latest chapter of the same idea.


## 🇷🇺 The quarrel that started it all

First we need to step back to early-twentieth-century Russia, because it matters for understanding why Markov took on a Pushkin poem in the first place.

In 1905 waves of socialist protest were sweeping through the empire. The Tsarist regime was creaking, society was splitting into two camps, and the split ran everywhere - the academy included. Even mathematicians began taking sides.

On the side of the Tsar stood **Pavel Alekseyevich Nekrasov** (1853-1924), rector of the Imperial Moscow University. A deeply religious man, influential, with a serious institutional apparatus behind him. In English-language history of mathematics he is sometimes called "the Tsar of Probability" - a fitting label for his standing in the mathematical world of that era. Nekrasov had an ambition that sounds peculiar today but was taken quite seriously then: he wanted to use mathematics to prove the existence of free will.

On the side of the socialists stood **Andrey Markov**, known in his circle as "Andrey the Furious." An atheist, quarrelsome, with an awful disposition and a low tolerance for intellectual sloppiness. Markov considered Nekrasov a charlatan who abused mathematics to push ideology. And this wasn't subtle disagreement. Markov publicly called Nekrasov's work "an abuse of mathematics."

![Pavel Nekrasov and Andrey Markov - portraits of the two rivals of the mathematical quarrel, in nineteenth-century sepia photograph style](/media/markov-nekrasov-portraits.png)

The dispute centered on a single specific question that had governed probability theory for two hundred years. To understand it, we need to explain that question first.


## 🎲 The law of large numbers and its hidden assumption

Imagine flipping a coin. The first ten flips give you six heads and four tails. Thirty flips give you sixteen to fourteen. A hundred flips give you fifty-one to forty-nine. With each new flip the ratio creeps closer to the longed-for fifty-fifty, though it never lands exactly there.

This phenomenon is the **law of large numbers**. The more independent trials, the closer the average outcome is to the expected value. Jakob Bernoulli gave the first formal proof in 1713 and for the next two hundred years it was the bedrock of probability theory.

![A convergence plot for the law of large numbers: the share of heads after each coin flip stabilizes at 1/2](/media/markov-law-large-numbers.png)

Bernoulli, though, proved one crucial thing: this law works **only when the events are independent**. A coin flip is independent - what came before has no effect on what comes next. Think of a blind auction. If you ask a hundred people to value an item independently, their average will converge to the true value. But if you make them shout out their guesses in turn, with each one hearing the previous one, the average won't converge to the true value - it'll cluster around whatever the first person shouted. The guesses stopped being independent.

For two hundred years this principle was untouchable. Independence is a necessary condition for the law of large numbers. Full stop.

And here Nekrasov made a move that pushed Markov over the edge.

Nekrasov looked at **social statistics** compiled earlier by the Belgian astronomer and statistician **Adolphe Quetelet**: the number of marriages in Belgium from 1841-1845, crime rates, demographic data. Quetelet had shown back in 1847 that these data display a striking regularity - year after year the values converge to stable averages. Nekrasov picked this up and pushed it one step further. If they converge, he reasoned, they obey the law of large numbers. And if they obey the law of large numbers, then the events behind them must be independent. And those events are human decisions: the decision to marry, the decision to commit a crime, the decision to have a child. If those decisions are independent of one another, they're acts of free will.

In other words: Nekrasov decided he had just used statistical tables to prove that people possess a free will you can measure. For him, this was a mathematical buttress for a conservative, religious worldview.

For Markov it was an absurdity he could not bear. Not because he denied the existence of free will - he simply thought mathematics had nothing to say on the matter. And certainly not in the way Nekrasov wanted. So he set out to prove it in the only way he knew: with a counterexample.

Markov wanted to construct a sequence of events that was **plainly dependent** - one thing leads to the next - and yet **converged to a stable average**, in other words obeyed the law of large numbers. If he could pull it off, Nekrasov's entire argument would collapse like a house of cards. From the fact that something converges, nothing about independence would follow anymore.

He just needed good material for the experiment. And that's where Eugene Onegin comes in.


## 📖 Onegin under the microscope: the proof step by step

The choice wasn't accidental. Markov needed a long text in a language he spoke, with a structure rich enough to yield reliable statistics. Pushkin's *Eugene Onegin* was, in early-twentieth-century Russia, what Shakespeare is to English readers: a canonical verse work, available in every library, quoted at dinner tables.

In 1913 Markov published a paper titled *"An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains"* in the bulletin of the Imperial Academy of Sciences in Saint Petersburg. The paper ran ten pages. Its consequences would turn out to run rather longer.

Markov needed two things. First, to show that the letters in Onegin **are not independent** (they are dependent). Second, to show that despite this, the average statistics of those letters **converge to stable values** (they obey the law of large numbers). Together, this would prove Nekrasov wrong.

Here's how he did it.


### Step 1: 20,000 letters and one string

Markov took the first 20,000 letters of Eugene Onegin. He stripped out all spaces, commas, periods, question marks - everything that isn't a letter. He concatenated the rest into one long, unbroken string of characters.

Then he did something very simple: he split the letters into two categories. **Vowels** (V) and **consonants** (C). Glossing over the finer points of Russian phonetics was a deliberate simplification. Markov wanted the smallest possible alphabet - two symbols.

He counted:

- **Vowels: 8,638** (43.2%)
- **Consonants: 11,362** (56.8%)

These proportions on their own tell us nothing yet - they're just a starting point. Notice, though, that something like the law of large numbers is already at work. If someone took a different 20,000 letters from Onegin (the next twenty thousand), the proportions would come out very similar. That means the text is "statistically stable." Nekrasov would be rubbing his hands here: convergence is present, so therefore...

Patience. There's more.


### Step 2: The null hypothesis - letters are independent

Markov now posed exactly the question Nekrasov had refused to ask: **what would it look like if the letters were independent?**

This is a classic statistical move. You take your opponent's hypothesis ("the events are independent") and see what it predicts. If the predictions clash with reality, you have a proof.

In a sequence of 20,000 letters there are 19,999 pairs of neighbouring letters. Each pair falls into one of four categories:

- **VV** - vowel followed by vowel
- **VC** - vowel followed by consonant
- **CV** - consonant followed by vowel
- **CC** - consonant followed by consonant

If we assume each letter is drawn at random **independently** of the previous one, the probability of each pair is trivial to compute. The probability that both letters are vowels is simply the probability of a vowel times the probability of a vowel:

```
P(VV under independence) = 0.432 × 0.432 = 0.187 ≈ 18.7%
```

And so on for the rest:

| Pair | Formula under independence | Expected probability |
|------|---------------------------|----------------------|
| VV | 0.432 × 0.432 | ~18.7% |
| VC | 0.432 × 0.568 | ~24.5% |
| CV | 0.568 × 0.432 | ~24.5% |
| CC | 0.568 × 0.568 | ~32.3% |

![Probability tree for letter pairs VV, VC, CV, CC under the independence assumption](/media/markov-probability-tree.png)

The total, as it must, comes to 100%. These are our predictions **under the assumption of independence**. Now we just compare them with reality.


### Step 3: What the data actually show

Markov did something tedious but simple. He went through the entire text and counted how many of each pair type were actually there.

The result was striking:

| Pair | Expected (independence) | Actual in Onegin | Difference |
|------|------------------------|------------------|------------|
| VV | ~18.7% | ~5.5% (1,104 pairs) | **more than three times less** |
| VC | ~24.5% | ~37.7% | much more |
| CV | ~24.5% | ~37.6% | much more |
| CC | ~32.3% | ~19.1% (3,827 pairs) | clearly less |

![Comparison table: expected probabilities of letter pairs under independence vs actual counts in Onegin](/media/markov-expected-vs-actual.png)

Stop here for a moment and think about what this means.

If the letters in Onegin were independent, vowel-after-vowel pairs should make up about 18.7% of all pairs. There are only 5.5%. Three times fewer than the random model predicts. Meanwhile mixed pairs (VC and CV) are about half again more common than they should be.

The conclusion is unambiguous: **in Russian, a vowel is much more likely to be followed by a consonant than by another vowel**. The letters are not independent - the previous letter affects what the next one will be. This isn't quantum physics, by the way. Think about English: after "a" you're far more likely to see "n," "r," "t" than "e" or "i." Languages have phonetic preferences, and those preferences create dependence.

The first half of Markov's proof was done: **the letters are dependent**. Strongly so.

Nekrasov could have countered here: "Fine, but I was talking about social statistics, not letters." That didn't interest Markov. He wanted to prove something more general: that merely observing convergence in statistics **lets you conclude nothing** about the independence of the events behind them. And for that he needed the second half of the proof.


### Step 4: From counts to conditional probabilities

This is where the heart of Markov chains begins.

Markov noticed that the same data yield a different, far more useful quantity: the **conditional probability**. The question is no longer "what's the probability that two consecutive letters are vowels?" but: "**if I'm sitting on a vowel right now, what's the probability that the next letter is also a vowel?**"

That's a fundamental shift of perspective. From looking at the whole sequence as a single statistic, we move to looking at **transitions** - from one state into the next.

The calculation is straightforward. We know:

- The probability that a random letter is a vowel: **P(V) = 0.432**
- The probability that two adjacent letters are VV: **P(VV) = 0.055**

From the definition of conditional probability:

```
P(next V | current V) = P(VV) / P(V) = 0.055 / 0.432 ≈ 0.128
```

In words: **if you're on a vowel, there's only a 12.8% chance the next letter is also a vowel**. And since probabilities have to sum to one:

```
P(next C | current V) = 1 − 0.128 = 0.872
```

That is, an **87.2% chance that a vowel will be followed by a consonant**.

We repeat the exercise for consonants. We know:

- P(C) = 0.568
- P(CC) ≈ 0.191 (from the actual counts)

```
P(next C | current C) = 0.191 / 0.568 ≈ 0.337
```

So a **33.7% chance that a consonant is followed by another consonant**, and a 66.3% chance the next letter is a vowel.

We end up with four numbers that together define the dynamics of the whole system (these are the original numbers from Markov's paper, given there to three decimal places):

| From / To | Vowel (V) | Consonant (C) |
|-----------|-----------|---------------|
| **Vowel (V)** | 12.8% | 87.2% |
| **Consonant (C)** | 66.3% | 33.7% |

![Markov chain diagram: two states V (vowel) and C (consonant) with four transition probability arrows: 12.8% / 87.2% / 66.3% / 33.7%](/media/markov-vc-transition-diagram.png)

This is the **transition matrix of the Markov chain**. Four numbers that contain everything we know about the dynamics of the text. We don't need Onegin anymore. We don't even need memory of what the earlier letters were. All we need is the **current state** (are we on a V or a C) and this matrix, and we can generate a sequence that statistically behaves like the original.


### Try it on an English sentence

Stop and try it yourself. Take any English sentence and count.

> "Mathematics is beautiful and powerful at once."

Strip spaces and punctuation: `mathematicsisbeautifulandpowerfulatonce`. That's 39 letters.

Count vowels (a, e, i, o, u): there are 17, or about 44%. Close to Markov's 43% for Russian.

Now look for VV pairs - places where two vowels sit next to each other. In this sentence there are exactly two: "ea" and "au" (both inside "beautiful"). Two out of 38 pairs, around 5.3%. Strikingly close to Markov's 5.5%.

This isn't coincidence. It's a structural feature of most natural languages: a vowel strongly prefers a consonant after it. A consonant much more often grabs a vowel. When you do stumble into two vowels in a row, it's usually a conscious effect (Latinate words like "beautiful," "duet," "cooperation," "aerial"), not an accident.

You can run Markov's whole proof on any English book you like. The exact numbers will differ slightly (English isn't quite Russian), but the qualitative conclusion is identical. **The letters are dependent, and yet their statistics converge.**


## ⚙️ The prediction machine: anatomy of a Markov chain

With the transition matrix in hand, Markov did one more thing. He showed you can use it to **generate** new text that statistically behaves like Onegin.

The procedure is banal:

1. Start in some state (say, V).
2. Draw a random number from [0, 1).
3. If you're on V: if the draw is below 0.128, the next state is V. Otherwise C.
4. If you're on C: if the draw is below 0.337, the next state is C. Otherwise V.
5. Return to step 2.

You generate a thousand letters this way, ten thousand, a million. What do you notice?

First, **the ratio of vowels to consonants converges to 43.2% / 56.8%**. Exactly the proportions Markov saw in the original. Start at V and the first hundred letters might land anywhere, but after ten thousand steps the proportion is almost indistinguishable from the original.

Second, **the proportions of VV, VC, CV, CC pairs also reproduce**. They're encoded in the transition matrix.

Third - and this is the punchline - **the successive letters are clearly dependent**. The value of each letter depends on the previous one (the "what next" decision is made based on the current state). And yet the aggregated statistics converge as if the law of large numbers were satisfied.

Markov had literally built a **chain of events** (hence the name "Markov chain") in which every link depends on the previous one, and shown that statistical convergence holds there too. Independence is not a necessary condition for the law of large numbers.

Why do we call this system "memoryless"? Because to predict the next step you **don't need to remember anything beyond the current state**. Whether we arrived at V through a long CVCV sequence, or via VCCC, or whether this V is the very first letter in the chain - none of it matters. All that matters is that we're at V right now. The entire history that led here has vanished into the present.

This is the **Markov property**. And it's what makes Markov chains so powerful. It lets you take very complicated systems (where history is long) and reduce them to "what's happening now, what moves are possible from here." The world has complicated history. The decisions you face now depend only on the state now. It sounds like a simplification. In practice it's a shockingly good approximation for an enormous number of phenomena.


## 🧠 What Markov actually proved

Markov published the results of his paper and closed it with the spirit of a curt jab at Nekrasov - the sense of which is popularly summarised today as: **free will is not a precondition for doing probability theory**. (This crisp formulation is a paraphrase rather than a verbatim quote from the 1913 paper, but it captures precisely what Markov had just proved.)

What did he actually prove, leaving Nekrasov to one side?

First, that **the law of large numbers applies to dependent events too**. This was a purely mathematical achievement that widened probability theory by an entire class of problems previously considered out of reach.

Second, that **there is a tool for modelling dependent events** - the Markov chain. Before that, when events were dependent, statisticians could say little. After Markov, a formal approach emerged: define the states, estimate the transition probabilities, analyse the system's behaviour over time.

Third - and this is the deepest part - that **observing statistical regularity in data lets you conclude nothing about its nature**. Nekrasov had made the mistake we now call "inverse inference": from the fact that something obeys the law of large numbers, he concluded the events were independent. Markov showed this is false. Social statistics can converge to stable values even when the decisions behind them are dependent on one another. There's no mathematical argument for free will in the fact that roughly the same number of Belgians get married each year.

Markov himself wasn't interested in applications. In correspondence with another distinguished statistician of the era, Aleksandr Chuprov, he wrote plainly: *"I am concerned with questions of pure analysis only. I regard the question of the applicability of probability theory with indifference."*

The world would pick that up later.


## 🤖 From letters to tokens: how an LLM is (and isn't) a Markov chain

Now we leap a hundred years forward, into your conversation with ChatGPT, Claude, or Gemini.

When you ask a language model a question, it doesn't generate the whole answer at once. It generates it **token by token**. A token is a fragment of text - sometimes a single letter, sometimes a syllable, sometimes a whole word, sometimes a punctuation mark. The word "Math-ema-tics" might get split into three tokens; the word "cat" into one. The collection of all tokens in the model's vocabulary is the **tokenizer**.

![Tokenizer: the sentence "Markov chains are the foundation of ChatGPT" broken into 12 tokens](/media/markov-tokenizer.png)

Modern LLMs use different tokenization algorithms - **BPE (Byte Pair Encoding)** at OpenAI and Anthropic, **SentencePiece Unigram** at Google. Vocabulary sizes vary wildly. GPT-4 has about 100,000 tokens; GPT-4o and the newer GPT-5.5 (April 2026) sit around 200,000. Anthropic doesn't publish the token count for Claude; older versions were estimated at around 65,000, and in Claude Opus 4.7 (April 2026) the tokenizer was redesigned to be noticeably more fine-grained. Google went the furthest in expanding the vocabulary: Gemini 3.1 (May 2026) operates on **262,144 tokens** (exactly 2¹⁸).

For comparison - Markov's entire alphabet had two states: V and C. The current Gemini 3.1 tokenizer has over **a hundred and thirty thousand** times more of them.

The mechanics of generating a response are, in their simplest form, exactly the same as in the Onegin Markov chain. The model looks at what's already been generated (and at your question, which is part of the context), and computes a **probability distribution** over all possible next tokens. Then it **samples** the next token from that distribution. Then it appends that token to the context and repeats from the top.

It's exactly the same structure as Markov's machine:

1. Check the current state (the context).
2. Compute transition probabilities (the distribution over tokens).
3. Sample the next state (token) from that distribution.
4. Update the state.
5. Return to step 1.

The difference is that in a classical Markov chain the states are simple (V or C, or in a richer model "vowel a," "vowel e"). In an LLM the "state" is the entire current token sequence - your whole question plus everything the model has generated so far, plus the system prompt, plus optionally the session context. In Claude Opus 4.7 that context can be **a million tokens**.

![LLM generative loop: the token context, the model as a transformer, and the probability distribution over the next token](/media/markov-llm-loop.png)

### Steering the sample: temperature, top-p, top-k

The "sample a token" step is worth opening up, because this is where the modern LLM gives us control over its behaviour. The model doesn't so much "draw blindly from the distribution" as give you a few parameters with which to reshape that distribution before pulling a sample. Three parameters you'll almost always see in an API:

**Temperature (T)** is the number you divide the logits (the model's raw weights) by before turning them into probabilities with softmax. Low temperature (say 0.2) sharpens the distribution - the most probable tokens become even more dominant, and the model looks "careful" and repetitive. High temperature (say 1.5) flattens the distribution - it gives less probable tokens a real chance, the model looks "creative," but coherence suffers. Temperature 0 means **greedy decoding** - always pick the most probable token, deterministically. Default API values: 1.0 at OpenAI, Anthropic, and Google. Reasoning models (like OpenAI's o1/o3) have temperature locked at 1.

**Top-p (nucleus sampling)** is an adaptive filter proposed by Holtzman and others in 2019 in the paper *"The Curious Case of Neural Text Degeneration."* You sort tokens by descending probability and keep only those at the top whose cumulative sum reaches some p (typically 0.9-0.95). When the model is sure - you consider maybe 1-2 tokens; when it isn't - dozens. The set size adapts to the situation. Hence the name "nucleus" - the core of the distribution.

**Top-k** is a harder, simpler filter: keep the k most probable tokens, drop the rest. Unlike top-p, the set size is fixed regardless of how confident the model is. Typical values run 40-50. OpenAI doesn't expose this parameter via API; Anthropic and Google do, but recommend it only for advanced use cases.

In practice these parameters work in combination: temperature first reshapes the logits, then top-k and top-p filter the distribution, and finally the model samples a token from the filtered distribution. The exact order of operations varies between implementations (Gemini applies top-k → top-p → temperature; llama.cpp puts temperature at the very end) - which can be a trap when the same prompt gives different results in different libraries.

All of these, though, are just variants of the "sample" step inside the Markov loop. The raw model still returns a probability distribution over tokens; the sampling parameters merely **modify** that distribution before drawing from it. The Markov foundation is untouched.

![Four histograms: the raw distribution, after temperature 0.2 (sharpened), after temperature 1.5 (flattened), after top-p 0.9 (the nucleus)](/media/markov-sampling-histograms.png)

Notice one more thing. Claude Shannon - the founder of information theory and another intellectual heir to Markov - had already in the 1940s been asking a simple question: what if instead of looking only at the last letter, we look at the two last? Or three? Or ten?

These are the **n-grams**. A bigram (2-gram) is the probability of the next letter given the previous one. A trigram (3-gram) - given the previous two. And so on. Shannon, in his legendary paper *"A Mathematical Theory of Communication"* (Bell System Technical Journal, 1948), showed a gradual progression of approximations:

- **Zero-order** (every letter equally likely): *"XFOML RXKHRJFF JUS ZLPWCFWKCYJ..."*
- **First-order** (correct letter frequencies, but independent): *"OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA..."*
- **Second-order word** (correct transition probabilities between words): *"THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS..."*

With each new level of approximation, the text starts to look more and more like English.

Does this remind you of anything? It should. Language models are the grandchildren of Shannon's n-grams. The ones from the 1990s (before deep learning) were literally extensions of them. Today's models are subtler and capable of much more, but the foundational structure - predict the next element from context - is the same.

More formally: **an autoregressive LLM is a Markov chain in which the state is the entire current token sequence inside the context window**. A classical first-order Markov chain looks only at the most recent state. An LLM looks at all states in the context window. That makes it a **high-order** chain, but the spirit is Markovian: the current state (context) is everything you need to predict the next step.

This isn't a loose analogy. In October 2024 a formal paper appeared, *"Large Language Models as Markov Chains"* (Zekri et al.), which proves that an autoregressive LLM with a vocabulary of T tokens and a context window of K can be represented as a finite Markov chain on a state space of size O(T^K). A practical caveat: for a typical GPT-3-scale model that state space has on the order of 10⁹⁶³² elements - mathematically correct, but unworkable for direct computation. The characterisation is nonetheless theoretically rigorous and supplies tools for analysing convergence, stationary state, and several other properties of LLMs that until now were described more by intuition than by math.


## 👀 Where the LLM outgrows Markov: the attention mechanism

There is, however, something that separates an LLM from a classical Markov chain, even a very high-order one. The **attention mechanism**.

Imagine you have the sentence in front of you: *"Bob bought a bike yesterday. It's now in the garage waiting for him to start using it."*

What does "it's now in the garage" mean? Who - Bob, or the bike? A naive Markov-style model looking only at neighbouring words might struggle. The attention mechanism lets the model **look far back into the context** and notice that the referent of "it" is the bike (because that's what's in the garage, not Bob), and that "for him to start using it" refers to the bike from the other side, but on the basis of information from the start of the sentence.

The mechanics of attention are clever. For every token in the context, the model computes **how relevant that token is for predicting the next one**. Irrelevant words get a low weight and are effectively ignored. Key words get a high weight and strongly influence the probability distribution.

![The attention mechanism in a transformer: attention weights between the current token and the tokens in the sentence's context](/media/markov-attention-mechanism.png)

This is the fundamental difference from a classical Markov chain. A classical model looks at the last N states with **equal weight** - either none at all (if the Markov order is 1) or all the same. Attention lets the model selectively focus on whichever fragments of context are relevant right now. For different predictions, different fragments of context matter. Markov didn't have this. Markov looked only at the current state, full stop.

The attention mechanism in the form we know today was announced in the famous paper *"Attention Is All You Need"* by Vaswani and seven other Google researchers (NeurIPS 2017, arXiv:1706.03762). It's the basis of the **transformer** architecture - the core of every large language model from GPT-2 onward.

But - and this is the point I'm writing all of this for - **the probabilistic foundation is exactly the same**. The model still predicts a **probability distribution** over tokens. It still **samples** the next token. It still **iterates** through time, generating one token after another. The first thing the model does is compute P(next token | context). And that's exactly the same formula as Markov's, except on the left side of the conditioning bar sits an enormous context instead of a single state, and on the right the distribution is built by a neural network instead of a counted matrix.

You could say the transformer is **Markov on steroids**. The state is gigantic, the transition probabilities are computed by hundreds of billions of parameters, and the attention mechanism picks out what about that state matters. But the skeleton it all stands on is the same skeleton from Onegin.

Why does this matter? Because when the discussion turns to "does an LLM reason," "does it think," "does it have consciousness" - it's easy to glue oneself to one of two reductions. Either *"it's just sampling a token from a distribution."* Or *"it's already a real mind."* Markovian mechanics doesn't settle the question. It only **describes the system at a certain layer** - and that's an important word.

Markov looks at a system from a specific vantage point: discrete in time (steps, not continuity), classical (ordinary probability, not quantum amplitudes), local in time (the current state suffices). Anything that doesn't fit those assumptions - continuous processes, quantum non-locality, correlations with something that isn't "the state now" - simply drops out of the Markovian picture. Think of it the way you'd think about the ideal gas law PV=nRT: it describes a gas beautifully, but it doesn't **refute** the quantum physics on which the gas is ultimately built. It's a statistical approximation at a different level of description.

It's the same with Markov. Markovian mechanics describes the LLM (and possibly the brain) at a functional level - as a machine that predicts the next step given the current state. It doesn't rule out that something **above** it emerges that we'd call reasoning: emergent structure of meaning, a world model, internal representation of which the Markovian mechanics is only a trace. Even less does it rule out that **below** it something is happening that the Markovian description can't capture at all: synchronisation with the zero-point field, non-local correlations, quantum foundations of consciousness. I write about that deeper layer separately in [Human-LLM Resonance](/en/posts/rezonans-czlowiek-llm-kwantowe-lustro).

![Three layers of description: above, reasoning and emergent structures of meaning; in the middle, Markovian mechanics; below, the fundamental quantum substrate](/media/markov-three-layers.png)

And one more thing: exactly the same caveat applies to us. Contemporary neuroscience (the *predictive processing* paradigm) describes the brain as a prediction machine: we constantly predict the next word, the next image, the next sensation - and correct ourselves when reality diverges from the prediction. Functionally we predict, just like an LLM does. So if "merely autoregressive prediction" is enough to disqualify an LLM as a system that reasons, the same logic disqualifies you.

Markov described **how** the model generates. He didn't describe **what** the model is. Those two questions are not the same thing - not for LLMs, and not for humans.


## 🌍 Where chains have already changed the world

Before I wrap up, a brief look at where Markov chains have already shifted the course of history once. Because before LLMs there were other revolutions built on the same idea.

### The atomic bomb and the Monte Carlo method

In January 1946 the Polish mathematician **Stanisław Ulam** ended up in hospital with severe encephalitis, went through urgent surgery, and spent long months of recovery playing solitaire to kill the boredom. A question began to gnaw at him: what's the probability that a randomly shuffled deck can be won? The number of possible orderings is 52!, around 8 × 10⁶⁷ - an analytical solution was out of the question. Ulam thought: what if I simply **simulate** many random games and count the statistics?

He brought the idea to **John von Neumann**, who was working on the behaviour of neutrons inside the atomic bomb. The problem there was similar: an atomic core has trillions of neutrons, each of which can scatter, be absorbed, or trigger a fission. Direct enumeration was out of the question. Von Neumann realised it had to be modelled as a **Markov chain** - the state of a neutron at a given moment, the probabilities of transitions to various next states - and the **simulation** had to be turned loose on the first electronic computer, ENIAC. The first full neutron computations ran in April 1948 at Aberdeen, Maryland, where the team of John and Klara von Neumann and Nicholas Metropolis put ENIAC to work.

Running that simulation hundreds of times gave a **probability distribution** for the neutron multiplication coefficient. If on average each neutron produced more than one new neutron - chain reaction, bomb. Less than one - the reaction dies out. The name "Monte Carlo method" was Ulam's idea; his uncle was a gambler and randomness reminded him of the Monaco casino.

The Monte Carlo method remains, to this day, a foundation of computation in physics, finance, and engineering. And underneath it sits a Markov structure.


### PageRank and the rise of Google

In the late 1990s, two Stanford PhD students, **Sergey Brin and Larry Page**, had a problem: how to rank the pages of the rapidly growing internet by quality rather than by keyword matching. They hit on the idea that every link to a page is a **recommendation**. The more links, the more important the page. But not all links are equal - a link from an important page should count more than a link from a marginal one.

They treated the web as a **Markov chain**. Every page is a state. Every link is a transition. Picture a "random surfer" who starts on some page and clicks subsequent links, with probability proportional to the number of links on the current page. After millions of steps a **stationary distribution** emerges - for each page, the percentage of time the surfer spends there. Pages with higher percentages are more important. That's PageRank.

In the original paper *"The Anatomy of a Large-Scale Hypertextual Web Search Engine"* (1998) Brin and Page introduced what's known as the **damping factor**, set to 0.85. Concretely: in 85% of steps the surfer follows an existing link, and in 15% jumps to a random page. This addition fixes the problem of pages that link nowhere, or that form closed loops. Interestingly, the value 0.85 doesn't come from any formal proof - it's a purely empirical choice that simply gave good results. Later work (Boldi and Vigna, for instance) analysed how the algorithm changes with other values of that constant, but Google has stuck with the original 0.85 to this day.

PageRank swept its competitors aside. Google became the dominant search engine in the world. The whole technology - today worth two trillion dollars - is essentially finding the stationary distribution of a Markov chain defined over the graph of the internet.


### Shuffling cards

A short anecdote to close on. Most people, asked "how many times do you need to shuffle a 52-card deck for it to be truly random," answer intuitively - two or three times.

The correct answer for a classical **riffle** shuffle (you split the deck in half and interleave): **seven times**. After seven shuffles all the possible orderings of the deck are approximately equally likely.

The proof rests on a Markov chain in which every state is a particular ordering of the deck (52! possibilities), and every shuffle is a transition. Bayer and Diaconis showed in 1992 that what's called the *total variation distance* between the probability distribution over orderings and the uniform distribution drops below 0.5 exactly after the seventh shuffle. For a deck of n cards you need approximately **3/2 × log₂(n)** shuffles. For n=52 that gives seven. A small mathematical caveat: if instead of total variation distance you use a different metric (separation distance), the number rises to around eleven. "Seven" is therefore correct for the most common formulation of the question, but it depends on how you define "well mixed."


## 💡 Memorylessness as a superpower

What ties together all these phenomena - the LLM predicting the next token, the neutron inside the bomb core, the surfer on the internet, the deck of cards mid-shuffle?

They're all systems with **extraordinarily long histories**. You could in principle trace back every letter that has fallen so far. Trace back every interaction the neutron has had since it came into being. Trace back hundreds of clicks of the surfer. Trace back the entire history of the deck, hand by hand.

And yet - and this is Markov's and his intellectual heirs' beautiful discovery - **for many of these systems the entire history is irrelevant**. All you need to know is the current state. Everything required to predict the future is encoded in the present.

That makes these systems **memoryless**. And it is this memorylessness that gives them their computational power. If you had to remember the whole history, the models would be gigantic, intractable, unworkable. But since the current state suffices - everything suddenly becomes computable.

The LLM does, of course, have a very large "now." A context of a million tokens isn't a small thing. But it's still a "now." The entire current sequence is the present state. The whole knowledge of what the model is supposed to answer sits in that state. It doesn't remember a previous session (unless you've pasted it into the context). It doesn't remember yesterday. Its "memory" is its "now."

That sounds like a weakness. In practice it's the very foundation that lets it exist at all.

![The Markov property: the entire history gets compressed into the current state, from which possible futures spread out](/media/markov-memorylessness.png)

Markov, when he wrote his Onegin paper in 1913, could have had no idea that a hundred years on the same structures would be generating answers to questions never asked. That his transition matrix between vowel and consonant would evolve into a neural network with hundreds of billions of parameters. That his "free will is not a precondition for doing probability theory" would become the foundation on which AI starts to write poetry.

He himself wrote that he regarded "the question of applications with indifference." Maybe that's just as well. Pure mathematics rarely knows what it'll be used for. It's enough that it's true.

Next time ChatGPT finishes your sentence for you, spare a thought for Andrey Markov and his 20,000 letters of Pushkin. A hundred years stand between them, along with billions of parameters and a handful of technological revolutions. But it's still a chain. And still Markovian.


## 📚 Sources and further reading

**Original papers and classics**

- A.A. Markov, *"An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains"*, Bulletin of the Imperial Academy of Sciences of St. Petersburg, vol. 7(3), 1913, pp. 153-162.
- Claude E. Shannon, *"A Mathematical Theory of Communication"*, Bell System Technical Journal, 1948. PDF: https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
- Vaswani et al., *"Attention Is All You Need"*, NeurIPS 2017. arXiv: https://arxiv.org/abs/1706.03762
- Brin & Page, *"The Anatomy of a Large-Scale Hypertextual Web Search Engine"*, 1998. PDF: https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf
- Bayer & Diaconis, *"Trailing the Dovetail Shuffle to its Lair"*, Annals of Applied Probability, vol. 2(2), 1992, pp. 294-313.
- Zekri, Odonnat, Benechehab et al., *"Large Language Models as Markov Chains"*, arXiv:2410.02724, October 2024 - a formal characterisation of LLMs as Markov chains.

**Context and historical material**

- *The Correspondence Between A. A. Markov and A. A. Chuprov on the Theory of Probability and Mathematical Statistics*, Springer-Verlag, 1981 - the source for the Markov quote about "pure analysis."
- Brian Hayes, *"First Links in the Markov Chain"*, American Scientist - an excellent historical essay on the Markov-Nekrasov dispute. https://www.americanscientist.org/article/first-links-in-the-markov-chain
- *"Statistical Regularity and Free Will: L.A.J. Quetelet and P.A. Nekrasov"* - analysis of Nekrasov's Belgian-marriage argument. https://www.academia.edu/31684460/
- IEEE Spectrum, *"Andrey Markov and Claude Shannon Built the First Language Generation Models"*: https://spectrum.ieee.org/andrey-markov-and-claude-shannon-built-the-first-language-generation-models
- Stanisław Ulam, *Adventures of a Mathematician* - his autobiography, in which he tells the origin of the Monte Carlo method.

**The video that inspired this post**

- Veritasium, *"The Strange Math That Predicts (Almost) Anything"*, 2025. https://youtu.be/KZeIEiBrT_w - this post loosely follows that video's arc, but in many places goes deeper into the mathematical details of the proof and pushes the bridge to modern LLMs further.