ML//model//GPT//nanoGPT

Andrej Karpathy's minimal pre-training codebase for reproducing GPT-2 from scratch. ~600 lines of core training code — a research instrument, not a product.


Andrej Karpathy's minimal pre-training codebase for reproducing GPT-2 from scratch. ~600 lines of core training code — a research instrument, not a product.

nanoGPT (2023) → nanochat (2025): added fp8 mixed precision, better data loading, fused kernels.

Trains GPT-2 (d12, 124M params) on a single 8×H100 node in ~2 hours. The speed enables rapid ablation: 8–12 experiments per day.

Dataset quality dominates: NVIDIA ClimbMix beat FineWeb-edu, DCLM, and OLMo data out of the box. Raises Goodhart's curse concerns — is ClimbMix optimized for the benchmarks?

Karpathy tested AI agents (Claude Code, Codex) as automated researchers iterating on nanochat. Result: agents implement well-scoped ideas perfectly but can't design experiments — no ablation discipline, no baseline control, spurious findings (e.g. "discovered" that bigger networks have lower loss without controlling for FLOPs).