Shortcut & Consistency models

Abstract

Support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality
trained either by distilling pre-trained diffusion models, or as standalone generative models altogether
They build on top of the probability flow (PF) ordinary differential equation (ODE) in continuous-time diffusion models, whose trajectories smoothly transition the data distribution into a tractable noise distribution. We propose to learn a model that maps any point at any time step to the trajectory’s starting point.
- self-consistency property: Points on the same trajectory map to the same initial point

A trained diffusion model, one way or another, estimates the score function of the probability distribution
- whether that’s done directly through score matching
- or through denoising objective, where in practice, $\nabla p (x_{t}) = - α ϵ_{0}$
  - where $ϵ_{0}$ is the source noise i.e. $x_{t} = x_{0} + ϵ_{0}$
As soon as you have the score function, assuming it’s a Gaussian Diffusion model, you can sample trajectories from pure noise $x_{T}$ to new “clean” samples $\overset{x}{^}_{0}$ , using an ODE solver (e.g. Euler).
Then you can enforce the objective that any samples on a trajectory should be mapped back to $\overset{x}{^}_{0}$ .

Building on top of the Flow matching training objective, one can define a shortcut model
Shortcut models condition the neural network not only on the signal level $τ$ but also on the requested step size $d$ .
This allows them to choose the step size at inference time and generate data points using only a few sampling steps and forward passes of the neural network.
For the finest step size $d_{m i n}$ , shortcut models are trained using the flow matching loss. For larger step sizes $d_{m i n} < d \leq 1$ , shortcut models are trained using a bootstrap loss that distills two smaller steps (defined as the average of the two midpoints), where $sg (\cdot)$ stops the gradient:

x_{0} \sim N (0, I) x_{1} \sim D τ, d \sim p (τ, d)

b^{'} = f_{θ} (x_{τ}, τ, d /2) b^{''} = f_{θ} (x^{'}, τ + d /2, d /2) x^{'} = x_{τ} + b^{'} d /2

L (θ) = ∥ f_{θ} (x_{τ}, τ, d) - v_{target} ∥^{2} v_{target} = {x_{1} - x_{0} sg (b^{'} + b^{''}) /2 if d = d_{m i n} else

The step size is sampled uniformly as a power of two, based on the maximum number of sampling steps $K_{m a x}$ , which defines the finest step size $d_{m i n} = 1/ K_{m a x}$ .
The signal level is sampled uniformly over the grid that is reached by the current step size:

d \sim 1/ U ({1, 2, 4, 8, \dots, K_{m a x}}) τ \sim U ({0, 1/ d, \dots, 1 - 1/ d})

At inference time, one can condition the model on a step size $d = 1/ K$ to target $K$ sampling steps, without suffering from discretization error because the model has learned to predict the end point of each step.
In practice, shortcut models generate high-quality samples with 2 or 4 sampling steps, compared to 64 or more steps for typical diffusion models.