Machine Learning / 2020 / arXiv

Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic ModelsHo, Jain, Abbeel

Asking a network to draw a face from nothing is asking for a miracle. Asking it to remove a little noise from a slightly noisy face is just regression. DDPM is what happens when you turn the first problem into a thousand copies of the second.

Generative modeling has a credibility problem when you state it honestly. You want a network that takes a random vector and outputs a photograph of a Labrador. The space of pixel arrays is astronomically large, almost all of it is noise, and the thin manifold of plausible Labrador photos is some incomprehensibly twisted shape inside it. Asking one forward pass to land on that manifold is asking for an oracle. Two thousand years of mathematics tells you that locating an exponentially small set in a high-dimensional space is hard. One thousand years of mathematics tells you that picking up the pieces of a smooth signal that has been corrupted by a little Gaussian noise is easy.

DDPM is the paper that took that observation seriously. Generation is hard. Denoising is easy. So decompose generation into a sequence of denoising steps, each one a problem you already know how to solve with a regression network and an MSE loss.

GANs tried to brute-force the one-shot version with adversarial training and produced beautiful samples and miserable training dynamics. VAEs tried to make the geometry tractable and produced stable training and blurry samples. Normalizing flows tried to make the likelihood exact and ended up architecturally hamstrung. All three were trying to learn the whole jump from noise to data in one step, or one tightly coupled encoder-decoder pair, or one invertible map. DDPM gave that up and got something that worked the first time you trained it, on every dataset, without tuning a discriminator schedule, on hardware that already existed.

DDPM's move is almost embarrassingly simple in hindsight. Don't try to jump. Decompose the jump into a thousand tiny denoising steps, each one easy enough that a regular network can solve it with mean-squared error. The network never has to imagine a Labrador from scratch. It only ever has to look at something that is almost-a-Labrador-plus-a-little-Gaussian-noise and guess what the noise was. Train the same network on a thousand different noise levels, then sample by removing one shell of noise at a time.

The result of that move was the most consequential generative-modeling paper of the decade. Stable Diffusion, Imagen, DALL-E 2, Midjourney, Sora, AlphaFold 3, RFdiffusion: all of them are this paper, plus engineering. If you understand exactly what Ho, Jain, and Abbeel did and why each piece is there, you understand the bones of every modern image, video, audio, and structure-prediction model worth talking about.

The forward process is free

Start with a real image x_0. Define a fixed schedule of small variances β_1, β_2, ..., β_T (typical T is 1000, typical β values rise linearly from about 1e-4 to 2e-2). At each step t, you do x_t = sqrt(1 - β_t) * x_{t-1} + sqrt(β_t) * ε with ε drawn from a standard Gaussian. This is the forward process. After enough steps, x_T is essentially pure noise — all the original signal has been washed out, and the marginal q(x_T) is approximately N(0, I).

Three things are worth noticing here. First, this forward process has no learned parameters. It's a fixed corruption process you apply to data. You are not training anything when you noise an image; you are just adding noise. Second — this is the trick that makes training work — because each step is a linear Gaussian, you can collapse the whole chain analytically. There's a closed form for x_t given x_0 directly: x_t = sqrt(ᾱ_t) * x_0 + sqrt(1 - ᾱ_t) * ε, where ᾱ_t is the cumulative product of the (1 - β) terms. Third, the marginal q(x_t | x_0) is itself a Gaussian, which makes everything in the variational analysis tractable.

So at training time you don't have to simulate a thousand steps. You sample a clean image, sample a random t uniformly from {1, ..., T}, sample a single ε ~ N(0, I), and compute x_t in one shot. That's the data point you train on. One forward pass. One backward pass. No autoregressive unrolling, no adversarial inner loop, no discriminator to balance. The training step is the same shape as a regression training step on any other dataset.

It is worth pausing on how unusual this is. In a GAN, training a generator requires a discriminator that itself needs training, and the two are locked in a non-stationary game. In a normalizing flow, the architecture is constrained to be invertible, which costs you expressiveness. In a VAE, you have to balance a reconstruction term and a KL term and pick a prior. In DDPM, you have a single network, a single MSE loss, and Adam. Anyone who has trained an image classifier can train a DDPM.

forward process q(x_t | x_0)

A fixed schedule erases the image

The forward process has no learnable parameters. Slide t and watch xt = √ᾱt x0 + √(1−ᾱt) ε carry an 8×8 pattern into Gaussian static. Below: the per-step variance βt and the signal-to-noise ratio that drops by ~five orders of magnitude across the chain.

t 1.0000 · βt 0.0000 · SNR 1.00e+9
x_t at t=0β_t (per-step variance)SNR(t) = ᾱ_t / (1 − ᾱ_t), log scalet →

Notice how SNR collapses long before the image looks like noise to your eye. By t≈40 the signal is already a thousand times weaker than the noise. The rest of the chain is the network learning to recover the last few bits of structure from what looks, to us, like pure static. Reference: β_t rises linearly, SNR falls geometrically.

The fixed forward process. Slide t and watch the closed-form q(x_t | x_0) grind a small grayscale pattern into Gaussian static while β_t and the signal-to-noise ratio sketch the schedule that drives it.

Why a closed-form marginal matters

If you couldn't write q(x_t | x_0) in closed form, you would have to simulate the entire forward chain to construct training data, and the cost of training would scale with T. With the closed form, the cost is constant in T. Doubling the number of timesteps in your schedule does nothing to wall-clock training time. This decoupling — T as a hyperparameter for sampling, not for training — is one of the unsung reasons diffusion models scale so well.

It also means the schedule is a knob you can tune post-hoc. Want a smoother SNR curve? Switch from linear β to cosine β (the trick from improved DDPM). Want fewer effective steps? Use a sub-sampled schedule at inference. None of this requires retraining. The forward process is free, and free things are flexible.

What is the network actually trying to predict?

Here's where DDPM made a small choice that turned out to matter a lot. The reverse process, in principle, asks the network to take x_t and the timestep t and predict the distribution of x_{t-1} — the slightly less noisy version. Because the forward process was Gaussian, the true reverse posterior q(x_{t-1} | x_t, x_0) is also Gaussian, and you can write down its mean and variance explicitly. So the network's job is to predict that Gaussian's parameters.

There are three natural parameterizations. You could have the network predict the mean of p(x_{t-1} | x_t) directly. You could have it predict x_0 (the fully clean image, with the network's job being "un-noise this all the way"). Or you could have it predict ε — the specific noise vector that was added. All three are mathematically equivalent under the linear Gaussian forward process. You can convert between them with one line of algebra.

Empirically they are not equivalent. Predicting the mean directly is fine but ties the network's output to the noise schedule. Predicting x_0 is fine but is hard at high noise — the network is being asked to hallucinate the entire image from almost-pure-static, which is the original problem you were trying to avoid. Predicting ε turns out to be the sweet spot: it gives the network the same shape of target at every t (a unit Gaussian sample), and it produces a loss that simplifies dramatically.

When you push the variational lower bound through the math under ε-parameterization, the per-timestep KL terms collapse, the weighting cancels almost away, and you are left with:

The original variational bound has KL divergences and Gaussian log-likelihoods all over it. The Ho et al. derivation walks through them and shows that if you drop the time-dependent reweighting (the L_simple form), training is more stable and sample quality goes up. The objective collapses to plain MSE on noise, with timestep just used as a conditioning input to the network. This is the kind of result that, when you derive it, you stare at the page for a minute and wonder if you made an algebra mistake. You didn't. The structure of the forward process is just that nice.

It's also worth noting how unusual a clean training signal this is. In language modeling, the loss is cross-entropy on next-token prediction — every batch contributes a clean, low-variance gradient. In image classification, the loss is cross-entropy on a single label — same story. Most generative models do not have this property: GANs have noisy adversarial gradients, VAEs have a posterior approximation introducing bias, autoregressive image models have to deal with mode imbalance across pixels. DDPM gives you a clean per-sample MSE gradient, like fitting a regression. That is a large part of why it trains so reliably.

Why predicting noise is secretly the score function

There's a deeper reason ε-prediction is the right move, and it's worth a section because it tells you where the entire field went next. The score function of a distribution is ∇_x log p(x) — the gradient of the log-density at a point, the direction in input space in which the data is more probable. If you knew the score, you could sample from the distribution by Langevin dynamics: take a noisy point, walk uphill on the score, add a little noise to avoid getting stuck in modes, repeat.

The score is normally impossible to estimate directly, because log p(x) is intractable for any interesting distribution. Score matching (Hyvärinen 2005) gave a way to fit a score model without ever evaluating the density, but it had practical issues: scores estimated near low-density regions are unreliable, and Langevin dynamics has trouble mixing between disconnected modes. Song & Ermon's 2019 NCSN paper fixed this by training a score network on multiple noise levels — a smoothed family of distributions, where the high-noise versions connect the modes.

The connection to DDPM is exact. Under the DDPM forward process, the score of the noised distribution at time t is exactly ∇_x log q_t(x) = -ε / sqrt(1 - ᾱ_t). Predicting noise is predicting the score, up to a known scaling factor. So when DDPM trains a network to estimate ε, it is training a score network across all noise levels, exactly the setup NCSN converged on. And the reverse-process sampler is, at heart, Langevin dynamics down a learned probability landscape.

This is why DDPM and the score-based generative models from Song & Ermon converged: they are the same algorithm in two different costumes. Song et al.'s 2021 paper made the equivalence explicit by formulating both as discretizations of a continuous-time stochastic differential equation (the probability flow SDE), which then admits a corresponding deterministic ODE (the probability flow ODE). The realization unified the field around 2020–2021 and is the reason later papers (rectified flow, flow matching, consistency models) could be derived as variations on a shared theme. If you had written down ε-prediction as a score-matching objective from the start, you would already be 80% of the way to flow matching.

score field ∇ log p_t(x)

The vector field a diffusion model is learning

Three-mode target in 2D. At each t we draw the analytic score of the noised marginal — the same quantity εθ approximates up to a scale. Particles drift along the arrows and concentrate on the modes. Watch the field reorganise as you scrub t.

t 0.288 · max ‖score‖ 4.44
x₁x₂

At large t the field is nearly radial — every point gets pulled toward the origin, because the noised distribution is almost a single unit-Gaussian. As t shrinks, structure emerges: arrows curve toward the nearest mode, and particles route into the three pockets of probability mass. Reverse diffusion is just following these arrows downhill, with a little stochastic kick to keep mixing.

The vector field a diffusion model is approximating. ∇ log p_t(x) for a 3-mode mixture in 2D. Particles released at high t drift along the arrows into the modes — reverse diffusion is exactly this drift, plus a small stochastic kick.

The reverse process: walk back through the chain

At inference time you start from pure Gaussian noise x_T. You ask the network: given this noisy thing and the fact that we're at timestep T, what's your best guess for the noise component? You subtract a scaled version of that prediction (this is the score step), add a little fresh noise (Langevin needs the kick to mix), and now you have x_{T-1}. Repeat with t = T-1, T-2, all the way down to t = 0. The full update under DDPM is x_{t-1} = (1/√α_t)(x_t - (β_t/√(1-ᾱ_t)) ε_θ(x_t, t)) + σ_t z, where z ~ N(0, I) and σ_t is typically chosen as √β_t.

By the time you reach x_0, the network has nudged you, step by step, from a random vector toward something that lies on the data manifold. Each individual nudge was small and locally well-posed. The cumulative effect is generation.

Coarse-to-fine, ordered by noise scale

This is the key intuition for why this works when single-shot generation doesn't: at high noise (early sampling steps), the network only needs to identify coarse structure ("there should be a face roughly here, with skin-toned pixels in the middle"). At low noise (late steps), it only needs to refine textures and edges. The hard global problem of "what is a plausible image" is decomposed into a curriculum of local denoising problems, ordered by scale. The network never has to solve the whole problem at once.

There's a useful analogy to drawing. An artist doesn't sketch a portrait by deciding the final color of pixel (372, 415) and committing. They block out shapes first, then refine, then add detail. Diffusion samplers do the same in noise space. The early steps decide the gist (where the face is, what species the dog is); the late steps decide whether the eyelashes are crisp.

This curriculum also has a nice information-theoretic reading. At step t the network has access to a signal whose SNR is fixed by the schedule. At high t the SNR is so low that no fine detail is recoverable from this single sample, so the optimal thing to predict is something close to the mean of the data — coarse structure. At low t the SNR is high and the optimal prediction has high-frequency content. The network is implicitly doing multiscale denoising, and the noise schedule is the thing that tells it which scale matters when.

reverse diffusion on 1D mixture

Walk noise back into structure, one score-step at a time

Two hundred particles start as Gaussian noise at t=T. At each step we move them along the analytic score of the noised mixture density — exactly what εθ would learn to predict. Watch a smooth bell collapse into two clean modes.

t 0.356 · σdata 1.02 · frac on +mode 44%
-4-3-2-101234target p₀(x)current p_t(x)x

The dashed gold curve is the target mixture; the teal curve is pt(x), the data convolved with Gaussian noise of variance 1 − ᾱt. The reverse step moves each particle along ∇ log pt(x) — the score, identical (up to scale) to predicting ε. Push guidance positive to bias the score toward the +2 mode; the population shifts even though the dynamics are the same.

Slide t from T to 0 and watch noise reorganize into two clean modes. The dynamics are real — every step uses the analytic score of the noised mixture, the very quantity ε_θ approximates.

Architecture: U-Nets, time embeddings, and self-attention

DDPM is a recipe; you still have to choose a network. Ho et al. used a U-Net, the same architecture from medical image segmentation. The reasons hold up: U-Nets preserve spatial resolution while still aggregating global context through downsample-then-upsample skip connections. The output is the same shape as the input, which is what ε-prediction wants. Subsequent papers added self-attention layers at the lowest spatial resolutions, which gave the network the global mixing GANs got from their large filter sizes.

The timestep is fed in as an embedding. The standard trick is to compute sinusoidal features of t (the same Fourier features you'd use for transformer position embeddings), pass them through a small MLP, and then add or modulate ResNet block activations with the result. This is essentially a FiLM layer — feature-wise linear modulation conditioned on t. The network now has a way to know which noise level it is denoising, which it has to know because the right answer at t=10 (almost-clean) is very different from the right answer at t=900 (almost-pure-noise).

None of these choices are sacred. The latent diffusion follow-ups switched to transformer backbones (DiT) and matched or beat U-Net quality. AlphaFold 3 uses a fully different architecture with structure-aware modules. The DDPM training recipe is architecture-agnostic; what matters is that your network takes in (noisy_input, t, conditioning) and outputs a same-shaped tensor, and that you train it with MSE on noise.

Conditioning and classifier-free guidance

Unconditional generation is a parlor trick. The interesting version is conditional generation: "draw a corgi wearing sunglasses." You pass a text embedding (or a class label, or a depth map, or whatever) into the denoiser alongside x_t and t. The network learns to denoise toward samples consistent with the condition, which means estimating the conditional score ∇_x log p_t(x | y) instead of the unconditional one.

The first version of this conditioning was classifier guidance (Dhariwal & Nichol). You train a separate classifier on noisy images, then at sampling time you bias the score by the gradient of the classifier's log-probability on the desired class. Algebraically: ∇_x log p_t(x | y) = ∇_x log p_t(x) + ∇_x log p_t(y | x). The first term is your score network. The second is the classifier gradient. Add them, optionally with a weight, and you get conditional sampling.

Classifier guidance works but is annoying. You need a separate classifier trained on noisy inputs. The classifier has to be reasonably accurate at all noise levels. And the trick doesn't generalize cleanly to text conditioning, where there is no fixed class set.

Classifier-free guidance (Ho & Salimans 2022) is the version that won. During training, you randomly drop the conditioning input — say, set it to a special null embedding 10% of the time. The same network thus learns both the conditional score (when y is provided) and the unconditional score (when y is null). At sampling time, you query both: ε_cond and ε_uncond. Then you take a guided prediction:

This trick is one of the most important sampling-time interventions in modern generative modeling. It costs nothing extra in training (you were already training the network; just zero out the condition sometimes) and gives you a single dial at inference time that trades off how-much-it-listens-to-you against how-diverse-it-is. Every text-to-image model you have ever used cranks this dial up to somewhere between 5 and 12 and tells you to lower it if you want surprises.

There is a Bayesian story for why w>1 helps. Consider sampling from p(x | y)^w · p(x)^{1-w}: at w=1 this is the true conditional, and at w>1 it is a sharpened conditional that down-weights samples consistent with many other conditions. CFG is implementing exactly this, modulo the noise-schedule technicalities. The intuition: in a world where most images are not-corgi-with-sunglasses, the model's marginal pull toward typicality is a problem; CFG lets you push back against it.

classifier-free guidance scale w

A sampling-time knob with a cost

Conditional target: a single peak at +2. Unconditional target: a 50/50 mixture at ±2. Guided score = (1−w) suncond + w scond. With w=0 you sample the unconditional mixture. With w=1 you sample the true conditional. With w>1 you over-extrapolate away from the unconditional, sharpening the conditional peak and squeezing diversity.

σ 0.37 · fidelity to +mode 97%
-4-3-2-101234samples at w = 3fidelity vs diversityw=-1w=0w=1w=2w=3w=5w=8σ (diversity) →fidelity (frac on +mode)

At w=0, the chain ignores the condition and lands half on each mode. At w=1 the conditional is recovered; samples cluster on +2 with the natural σ ≈ 0.4. Push w higher and the bell sharpens further at the cost of variance — at w=8 the samples are over-saturated, hugging the mode tighter than the data ever did. This is the diffusion-image equivalent of cranking CFG to 12 and getting cartoonish, prompt-fixated output.

Sweep w. At 0 you sample the unconditional mixture. At 1 you sample the conditional. At 8 you over-saturate to the conditional mode and lose all diversity — the same dynamic that gives you cartoonish, hyper-on-prompt images when you crank text-to-image guidance too high.

Sampling speed: DDIM, ODEs, and the road to one step

DDPM samples are slow because you have to run the network a thousand times per image. For unconditional CIFAR-10 in 2020, this was a curiosity. For text-to-image at 1024×1024 on a phone, it is a problem. Almost everything that came after DDPM was an attempt to make the chain shorter or smarter.

DDIM (Song et al. 2021) was the first major breakthrough. The observation: the same trained ε-network can be re-interpreted as a deterministic mapping from noise to data. Specifically, the forward process can be expressed as a marginal of a non-Markovian process whose corresponding reverse update is deterministic given the network output. That deterministic update is the DDIM sampler. It admits arbitrary sub-sampling of the timestep schedule (use 50 steps instead of 1000) without retraining. Quality stays high even at K=20, where DDPM stochastic sampling has visibly degraded.

The reason DDIM works at low K is geometric. DDPM sampling adds fresh stochasticity at every step, so each step's error is independent and the variance accumulates linearly with the number of steps you skip. DDIM sampling is deterministic, which means it is integrating an ODE — the probability flow ODE — and ODE integrators get to use the trained vector field's structure to take longer steps. With K=50 DDIM steps you are running a 50-step Euler scheme on a smooth ODE; with K=50 DDPM steps you are running a noisy Euler-Maruyama scheme on the corresponding SDE, and the SDE is harder.

Once people saw the ODE, the field reached for the standard ODE-solver toolbox. DPM-Solver uses higher-order methods to get DDPM-quality samples in 10–20 steps. PNDM uses linear multistep methods. Heun's method with carefully tuned schedules gets you to 5–10 step generation. None of these require retraining; they are inference-time accelerators on top of the same ε-network.

Then a different question: can we train a model that generates in one step? Consistency models (Song et al. 2023) distill a pre-trained diffusion model into a one-step generator by enforcing that the network's prediction is consistent across all timesteps along the same ODE trajectory. Rectified flows (Liu et al.) train a model to fit straight paths from noise to data; once the paths are straight, a single Euler step is exact. Flow matching (Lipman et al.) gives a more general framework that subsumes both. The line of work is pointed at one-step generation as the destination, with consistency-distilled latent diffusion already approaching the quality of multi-step samplers in production text-to-image systems.

sampling steps vs quality

Why DDIM exists

Same trained score, two samplers. DDPM is the original stochastic chain — collapses if you try to short-circuit it. DDIM is a deterministic ODE-style sampler that interprets the same network as a velocity field; it gives clean samples even with K=5.

mode ddpm · K 25
-4-3-2-101234DDPM samples · K = 25fraction of samples on a mode (±1) vs K11010010000.00.51.0DDPMDDIM

At K=1000 both samplers reproduce the bimodal target. At K=25 DDPM samples are still roughly right; DDIM is essentially perfect. At K=5 DDPM noticeably degrades while DDIM still nails both modes. At K=1 neither works — there is no free lunch — but the deterministic sampler degrades much more gracefully. This curve is the entire reason consistency models, rectified flow, and one-step distillation became a research direction: shrinking K is worth a lot.

DDPM and DDIM, same trained score, different samplers. At K=1000 both reproduce the target. At K=5 DDIM still nails it, DDPM is degraded. At K=1 neither works — there is no free lunch — but DDIM degrades much more gracefully, which is why ODE samplers and consistency distillation became a research direction.

Latent diffusion: don't denoise pixels, denoise codes

There is a second axis of speedup, and it is independent of the sampler. Latent diffusion (Rombach et al. 2022) noticed that you don't need to run the diffusion process in pixel space at all. Train an autoencoder that compresses 512×512 RGB images to a 64×64×4 latent, then run diffusion in that latent space. The denoising network is much smaller (because the input is much smaller), the chain is much cheaper per step, and the autoencoder absorbs the high-frequency reconstruction work that diffusion is bad at anyway.

Stable Diffusion is this. Almost all production text-to-image is this. Almost all production text-to-video is some variant of this. The autoencoder is typically a VQ-VAE or KL-regularized continuous autoencoder; the diffusion model in the middle is a U-Net or DiT trained with the standard ε-MSE loss. Conditioning (text, image-to-image, ControlNet, depth, etc.) gets added as cross-attention onto the U-Net, with classifier-free guidance on top.

The key insight is that perceptually-relevant generation and high-frequency reconstruction are different problems. The autoencoder solves the high-frequency reconstruction problem with a deterministic, low-capacity network. Diffusion solves the perceptual generation problem in a smaller space where every dimension matters. Splitting these problems was worth maybe a 10× speedup at fixed quality, and it's the architectural reason text-to-image got cheap enough to put on consumer GPUs.

Where the field went after that

The DDPM playbook escaped images quickly. AlphaFold 3 (Abramson et al. 2024) replaced AlphaFold 2's structure module with a diffusion-style sampler over atomic coordinates, which let the same model predict protein-ligand-DNA-RNA complexes within a single architecture. The diffusion isn't quite the same as image DDPM — the noise schedule is over 3D coordinates, the network is structure-aware — but the core algorithm is recognizable: noise the structure, learn to denoise it, sample by reverse diffusion. The 'generation is hard, denoising is easy' principle generalized cleanly from RGB to coordinates.

RFdiffusion (Watson et al. 2023) does protein backbone design by denoising. Given a constraint (a binding site, a fold class, a symmetry), it samples plausible protein backbones from noise. The network is a structure-aware variant of RoseTTAFold; the training signal is the standard ε-MSE on noised SE(3) coordinates. The successes — from de novo binders to enzymes to symmetric assemblies — read like the image-generation results from 2021, transplanted into structural biology.

Audio (AudioLDM, MusicLM-style audio backbones), video (Sora and its open-source descendants), 3D shape generation (DreamFusion, Shap-E), molecular conformer generation (GeoDiff, MiDi), even decision-making (Diffuser for trajectory generation in RL): all of them are running the DDPM playbook with domain-specific architectures and noise distributions. The set of fields that have not yet been touched by diffusion is shrinking faster than the set of fields that have.

What it doesn't solve

Sampling speed is still the headline cost, even after all the speedups. Generating a high-quality 1024×1024 image with a state-of-the-art latent diffusion model is more expensive than running a discriminative model once, often by a factor of 20–100×. For real-time applications this matters. Consistency models and few-step distillations are closing the gap, but the trade-off between sample quality and step count is still real, and the few-step models tend to lose some of the multi-step model's compositional precision.

Likelihood is awkward. Diffusion models do define a likelihood (you can compute it via the probability flow ODE), but it's not as easy to evaluate or compare as autoregressive likelihood. This means model selection and uncertainty quantification have to use proxy metrics (FID, CLIP score, human eval) rather than held-out log-likelihood, which introduces all the well-known problems with sample-quality metrics. FID in particular is known to disagree with human preferences in systematic ways once the underlying models get good enough.

Compositional generation is unsolved. Ask a diffusion model for "a red cube on top of a blue sphere" and you get something like that, sometimes. Ask it for "five apples, three oranges, and two pears" and counting fails. The model has learned the marginal distribution of natural images, not the compositional grammar of the world. This is the same failure mode that classifier-free guidance papers over with brute-force prompt adherence and that ControlNet-style conditioning sidesteps by providing explicit structural inputs.

And there's still the question of what diffusion is learning in any deep sense. The score function is a low-level object; the high-level structure ("this is a corgi", "this protein binds this ligand") is implicit in how the network composes denoising decisions across scales and steps. We don't have great tools to read out conceptual structure from a denoiser the way mechanistic interpretability is starting to read out structure from transformers. Diffusion is a powerful machine that we don't fully understand. The interpretability gap is one of the things that makes safety arguments about diffusion-based image and video generation harder than they need to be.

If you read the original

The Ho/Jain/Abbeel paper is exceptionally well-written. The key derivation is in Section 3 — work through equation 14 (the simplified L_simple objective) by hand at least once, paying attention to which weighting terms cancel and which are dropped by hand. You'll want to also read Sohl-Dickstein 2015, the paper that originally proposed diffusion and was largely ignored for five years until DDPM made it work. And Song & Ermon's score-matching paper from 2019 (NCSN) is the other half of the unification — once you've read both, you understand why diffusion is just one face of a more general object.

If you want the full modern framing, Song et al.'s 2021 Score-Based Generative Modeling through Stochastic Differential Equations paper is the right next step. It rederives DDPM and NCSN as discretizations of the same SDE family and introduces the probability flow ODE. After that, Lipman et al.'s flow-matching paper and the Song et al. consistency models paper give you the post-DDPM reading list.

The exercise: train a tiny diffusion model on MNIST. A 2-layer MLP with sinusoidal timestep embedding is enough. Use the linear β schedule, T=1000, the L_simple objective, and Adam at 1e-3. Sample with the standard DDPM update. Watch the noise schedule pull recognizable digits out of static. The first time it works — and it will work, on the first try, with about 200 lines of code — you stop being mystified by Stable Diffusion. Then swap in a DDIM sampler at K=50 and watch the same model produce the same digits in 1/20th the time. Then add a class label as a one-hot conditioning input, drop it 10% of training, and play with classifier-free guidance. By the time you have done all three you have implemented, in miniature, the entire diffusion stack that powers modern generative AI.