Introduction
Mathematical formulation of training: TBD
- Activation
- Gradient
- Optimizer
Parallelism in training:
- Data parallelism
- Model parallelism
- Pipeline parallelism
- Tensor parallelism
- Context parallelism
References
[1] How to Scale Your Model: https://jax-ml.github.io/scaling-book [2] The Ultra-Scale Playbook: Training LLMs on GPU Clusters: https://huggingface.co/spaces/nanotron/ultrascale-playbook