LLM is decoding is memory-bound. Diffusion models is compute-bound. (Why??)

Deep Compression:

  • deep compression VAE

Distillation:

Quantization:

  • SVDQuant

Parallelism:

  • tensor parallelism
  • data parallelism