ML//model//GPT//GPT-3//in-context learning

2020-10-15

The model adapts its behavior based on what's in the context window, without any weight updates.

The model adapts its behavior based on what's in the context window, without any weight updates.

The broader concept behind few-shot learning: give examples in the prompt -> model generalizes. But also includes implicit pattern matching with zero examples.

Mechanistic explanation: induction heads, circuits of attention heads that find previous occurrences of patterns and copy what followed. "El gato duerme... El gato ____" -> "duerme".

During training, when induction heads emerge there's a phase transition: abrupt loss drop as the model suddenly learns to follow context patterns.

Still not fully understood beyond induction heads. Is it gradient descent in the forward pass? More complex pattern matching? Likely multiple mechanisms.

Extended thinking exploits this at its deepest: the model's own generated reasoning tokens become part of the context, and in-context learning mechanisms process them to produce better final answers.

Why RAG, few-shot, chain of thought, and prompt engineering work: induction heads and related circuits are reading the context and propagating patterns to new tokens.