FP16 training, with scaling factor why scaling factor? otherwise gradient does not fall into the range of FP16
BF16 training
FP8 training??
FP4 training?
- MXFP4:
- 32-element block, E8M0
- NVFP4:
- 16-element block, E4M3
Non-deterministic in LLM inference: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Defeating the Training-Inference Mismatch via FP16 https://arxiv.org/pdf/2510.26788
SGLang FP8 training as RL: https://lmsys.org/blog/2025-11-25-fp8-rl/