Introduction

Mathematical formulation of training: TBD

  • Activation
  • Gradient
  • Optimizer

Parallelism in training:

  • Data parallelism
  • Model parallelism
  • Pipeline parallelism
  • Tensor parallelism
  • Context parallelism

References

[1] How to Scale Your Model: https://jax-ml.github.io/scaling-book [2] The Ultra-Scale Playbook: Training LLMs on GPU Clusters: https://huggingface.co/spaces/nanotron/ultrascale-playbook