ML//Inference//quantization

- Reduce weight precision: FP16 → INT8 → INT4.


Reduce weight precision: FP16 → INT8 → INT4.

Model gets 2-4× smaller, runs faster. Accuracy loss is surprisingly small.

GPTQ, AWQ, GGUF — different methods, same idea: most weights don't need 16 bits.

Why you can run a 7B model on a laptop with 8GB RAM.