ML//Training//distributed training

- Training across multiple GPUs/nodes.


Training across multiple GPUs/nodes.

Data parallelism: same model on every GPU, split the batch. Model parallelism: split the model across GPUs. Pipeline parallelism: split by layer.

Communication overhead is the bottleneck — NVLink and InfiniBand interconnects matter as much as the GPUs themselves.