Notes on The Ultra-Scale Playbook

high level overview

first steps: training on one gpu

memory usage in transformers

activation recomputation (checkpointing)

Step‑by‑step improvements

gradient accumulation

Step‑by‑step improvements

profiling gpu compute and communication

Step‑by‑step improvements

data parallelism (dp)

Step‑by‑step improvements

Key formulas and tips

zero: ze ro redundancy optimizer (deepSpeed/fsdp)

Step‑by‑step improvements

Analogy

tensor parallelism (tp)

Step‑by‑step improvements

Analogy

sequence parallelism (sp)

Step‑by‑step improvements

Analogy

context parallelism (cp)

discovering ring attention (and zig‑zag variants)

Step‑by‑step improvements

Analogy

pipeline parallelism (pp)

Step‑by‑step improvements

Analogy

expert parallelism (ep)

Step‑by‑step improvements

Analogy

5d parallelism in a nutshell (how parts fit together)

finding the best training configuration

step 1: fit a training step in memory

step 2: reach the target global batch size

step 3: optimize throughput

Decision checklist (quick rules)

benchmarking at scale — lessons

diving into gpus: fusing, threading, mixing

gpu primer (very short)

writing faster kernels without writing cuda

memory coalescing, shared memory, tiling, thread coarsening

fused kernels

flash attention (v1–v3)

mixed precision training

Step‑by‑step improvements

Analogy

conclusion

appendix highlights

a0: parallel programming crash course

a1: distributed training profiling

a2: typical scales

a3: math for overlap

references