Novel stuff

Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

have two diffusion models
- one responsible for semantic content, acting on very small semantic space $A$ , it takes as input the text conditioning
- one responsible for image reconstruction from noised latent (in another space) $B$
- feed the input from $A$ as a cross-attention layer into $B$ .

walk = sampling
- As explained in Score-based Generative models, one can learn arbitrary distribution and sample from it by using the score function and langevin dynamics sampling or MCMC.
jumping = denoising
- Additionally, what the paper refers as Neural Empirical Bayes (NEB) is simply the Tweedie’s formula which states that to estimate the mean of an exponential distribution, one should take the sample and correct it by the score function, multiplied by the variance.
The Walk-jump sampling scheme samples noisy data using Langevin MCMC, and obtains clean samples by applying the Tweedie’s formula correction.
- One network is responsible for learning the score function
- (Another is responsible for learning the appropriate variance to use when applying the correction ⇒ boils down to estimating the variance of the applied noise w.r.t the clean sample) MAYBE WRONG