ML//model//GPT//nanoGPT

2026-03-08

Andrej Karpathy's minimal pre-training codebase for reproducing GPT-2 from scratch. ~600 lines of core training code, a research instrument, not a product.

Andrej Karpathy's minimal pre-training codebase for reproducing GPT-2 from scratch. ~600 lines of core training code, a research instrument, not a product.

nanoGPT (2023) → nanochat (2025): added fp8 mixed precision, better data loading, fused kernels.

Trains GPT-2 (d12, 124M params) on a single 8×H100 node in ~2 hours. The speed enables rapid ablation: 8–12 experiments per day.

Dataset quality dominates: NVIDIA ClimbMix beat FineWeb-edu, DCLM, and OLMo data out of the box. Raises Goodhart's curse concerns. Is ClimbMix optimized for the benchmarks?

Karpathy tested AI agents (Claude Code, Codex) as automated researchers iterating on nanochat. Result: agents implement well-scoped ideas perfectly but can't design experiments: no ablation discipline, no baseline control, spurious findings (e.g. "discovered" that bigger networks have lower loss without controlling for FLOPs).