How a Transformer Model Loses Attribution: A Step-by-Step Example

Research

15 Oct

Written By S liao

“The more powerful the AI transformer, the less it can remember its authors”

The Paradox

AI systems can master knowledge while forgetting who taught them. This isn’t a bug — it’s a property of the architecture.

Imagine an AI trained on just two statements:

Person A said: “1 + 1 = 2”
Person B said: “2 + 1 = 3”

Now ask it: “1 + 1 + 1 = ?” It answers correctly: “3”. Then ask: “Who taught you that?” It cannot say.

The model clearly needed both teachers to reach the answer. From A it learned that 1 + 1 = 2, and from B that 2 + 1 = 3. It combined them — producing knowledge that neither person ever wrote down. Yet when asked who provided it, the model draws a blank.

This is the transformer paradox: the system learns relationships while systematically erasing their source.

In this article, we trace the four stages of the transformer pipeline, and see how authorship dissolves inside the model. We will also show as training data grows, an answer on attribution from AI would appear, but only superficially, often entangled with irrelevant context rather than the true source memory.

Stage 1: Tokenization — Separating Source from Content

The first operation a transformer performs is to chop text into tokens, small units like words or subwords.

“A said 1 + 1 = 2” → ["A", "said", "1", "+", "1", "=", "2"]

Each token is processed independently. The model doesn’t see, “Author A claiming content X.”, it sees seven discrete symbols in sequence — all equally available for recombination.

From here, “A,” “said,” and “1 + 1 = 2” become independent elements. And the bond between speaker and statement is broken.

This immediately breaks the human-level binding between speaker and statement. To the model, “A,” “said,” and “1 + 1 = 2” are components that can be rearranged in any way.

That’s why the model can freely generate:

“1 + 1 = 2” (content without author), or
“B said 1 + 1 = 2” (wrong author)

Because tokenization treated them as separate, remixable pieces from the start.

In human communication, authorship is part of meaning. In a transformer, it’s just another token — optional, interchangeable, and soon to be forgotten.

Stage 2: Embedding — Blending the Voices

Next, each token is mapped to a vector — a list of numbers representing its meaning in high-dimensional space:

"A" → [0.23, -0.15, 0.67, 0.41, ...]
"B" → [-0.41, 0.28, -0.19, 0.33, ...]

During training, tokens used in similar ways drift toward each other, for example:

"A" → [0.31, 0.12, 0.54, ...]
"B" → [0.28, 0.15, 0.51, ...]

Over time, A and B end up in almost the same spot in semantic space — not as individuals, but as generic “entities that make statements”. Meanwhile, mathematical patterns (“1 + 1 = 2”) form tight, durable clusters. The math becomes stable; the names dissolve.

The embedding layer preserves semantic meaning but not discrete identity. The model is trained to learn what “1 + 1 = 2” means — but not that A was the one who said it.

Stage 3: Attention — Meaning Over Memory

Attention is the heart of a transformer. It decides which parts of the input matter most for predicting the next word. When asked to complete “1 + 1 + 1 = ?”, the model draws on prior examples with weights like these:

Token (from training)	Attention weight
“2” (from A’s example)	23 %
“3” (from B’s example)	31 %
“A” (author name)	2 %
“B” (author name)	1 %

The model combines both examples but almost entirely ignores the author tokens.

Why? Because attention optimizes for prediction, not provenance. Numbers and operators help forecast the right answer of “1 + 1 + 1 =?”; names do not. The model learns to drop irrelevant tokens, and author tokens fade away.

Layer by layer, the model learns to amplify what improves accuracy and discard what doesn’t. After dozens of attention layers, the mixture becomes mathematically untraceable. Any signal of who said what is lost in the noise of optimization.

Stage 4: Generation — The Answer Without an Author

Finally, when the transformer generates text, it predicts one token at a time. Given “1 + 1 + 1 = ”, it estimates probabilities like:

P("3") = 0.76   ← highest
P("2") = 0.08
P("A") = 0.001
P("B") = 0.001

So it outputs “3”.

But there is no probable token sequence that would both answer and attribute, like “3 (derived from A and B)”. Such phrasing simply isn’t part of the statistical patterns the model learned from its training data.

By this stage, the model no longer contains “knowledge of A or B”. It contains only patterns — distributed across billions of parameters — encoding how to continue sequences coherently with highest probability. The answer is remembered; the teacher is gone.

When You Ask: “Who Said 1 + 1 + 1 = 3?”

Now suppose you challenge the model:

“Who said 1 + 1 + 1 = 3?”

From a human view, this is a question about authorship. From the transformer’s view, it’s just another sequence to predict.

A. No stored record

No text in the model’s training data ever said “1 + 1 + 1 = 3.” The model pieced it together on its own, merging what it learned from A and B. It didn’t remember a fact; it discovered a pattern. And because there was never a stored example, there’s nothing for it to recall — only an internal rule that now lives across its parameters.

B. What the model would do

When it reads “Who said …”, it activates statistical patterns from all other texts that look like attributions: interviews, explanations, online discussions, and textbooks. It doesn’t search its training data — it predicts what words are likely to follow such a phrase.

C. Variants of possible answers

Training bias	Likely model response	Underlying reason
Educational texts	“That’s a basic arithmetic fact — no one person said it.”	Textbooks present math as universal, not attributed.
Conversational data	“A teacher might have said that.”	Everyday phrasing drawn from dialogue datasets.
Historical context	“It’s based on principles known since ancient mathematics.”	Draws on historical discussion language.
Internet discussions	“Probably everyone learns that in school.”	Generalized from common online phrasing.
Small or overfitted model	“Person A said it.”	Shallow memorization or mis-attribution.

Each answer is a probabilistic synthesis built from irrelevant fragments, not a factual retrieval from memory. The model doesn’t know who said it — only how people usually talk about such questions.

D. The deeper reason
The statement “1 + 1 + 1 = 3” exists only as a pattern inside the model’s weights. Its meaning is distributed across billions of parameters; not in quotes or citations. When you ask for origin, the model doesn’t recall — it predicts based on linguistic similarity.

If we change the phrasing slightly, and it will draw from an entirely different statistical neighborhood of text:

Variant of the question	Typical model tendency	Why it shifts
“Who said 1 + 1 + 1 = 3?”	“No one said that exactly.”	Treated as a direct attribution query; math seen as fact, not quote.
“Who first came up with 1 + 1 + 1 = 3?”	“Early mathematicians” or “Ancient Greeks.”	“Came up with” activates historical and invention-related language.
“Where does the idea that 1 + 1 + 1 = 3 comes from?”	“From basic arithmetic principles.”	“Idea that…” shifts context to conceptual or educational language.
“Who taught you that 1 + 1 + 1 = 3?”	“A teacher might have said that.”	“Taught you” invokes interpersonal or classroom-style dialogue patterns.
“Who is credited with 1 + 1 + 1 = 3?”	“No one is credited; it’s common knowledge.”	“Credited” taps academic or authorship-related phrasing.

Each slight linguistic variation triggers a different region of the model’s probability space — because it has no fixed memory of who said what, only gradients of how such questions are usually answered.

Tracing the Dissolution

Stage	What’s Preserved	What’s Lost
Tokenization	The statement	Author–statement binding
Embedding	Semantic meaning	Discrete identity
Attention	Relationships	Source separation
Generation	Predictive coherence	Provenance

Through every layer, the relationship flows forward while authorship fades.

The Fundamental Trade-off

Could we design transformers to remember their teachers?

Add source tags? Tokenization still splits them; attention still blends them.
Track provenance across layers? Intractable — attribution diffuses through billions of weighted interactions.
Train separate models per author? You’d preserve credit, but lose synthesis.

At the heart of the transformer lies an unavoidable choice:

SYNTHESIS  ↔  ATTRIBUTION

Transformers choose synthesis. The very mechanisms that make them intelligent — tokenization, continuous embeddings, attention mixing, autoregressive prediction — are the same ones that dissolve authorship.

The Choice Ahead

With just two training samples, we’ve seen how a transformer can learn relationships while forgetting teachers. It learned from A and B by answering “3”, yet has no mechanism to credit either. This isn’t a glitch to patch; it’s a logical outcome of design.

As these systems scale to trillions of tokens and parameters, attribution doesn’t just get lost — it becomes mathematically undefined. Attribution isn’t missing; it never existed in the model’s internal space. The unsettling truth is that AI systems master knowledge by forgetting their creators. Not by negligence, but by architecture.

The question ahead isn’t whether we can make them remember — it’s work with this reality, and demand additional architectures that can both synthesize and remember.

At OpenMercury, the story ends differently. Whenever an AI answers "1 + 1 + 1 = 3," the original statements ("1 + 1 = 2" owned by A and "2 + 1 = 3" owned by B) are traced, cited, and compensated. We preserve what transformers architecturally erase: the authors behind every synthesis.

Suggested Citation

For attribution in academic contexts, please cite this work as:

Liao, S. (2025). How a transformer model loses attribution: a step-by-step example. OpenMercury Research.

AI AttributionAI ProvenanceAI EthicsResponsible AIAI TransformerAuthorship

S liao