skip to main content
Caltech

CMX Lunch Seminar

Tuesday, February 17, 2026
12:00pm to 1:00pm
Add to Cal
Annenberg 213
A Principled Framework for Discrete Diffusion Models via Denoising
Ricardo Baptista, Assistant Professor, Statistical Sciences, University of Toronto,

Discrete generative models provide a probabilistic framework for representing and sampling discrete data such as text sequences. In the continuous setting, score-based diffusion models have rapidly become state of the art for tasks involving images, video, and other continuous-valued data. A key reason for their success is that estimating the score function, the gradient of a perturbed data distribution, can be linked to a denoising problem through Tweedie's formula, which enables the use of well-established supervised methods for learning the score. In the discrete setting, diffusion models offer a promising alternative to autoregressive models for generating entire text sequences at once in large language models. However, they have not yet achieved performance comparable to autoregressive models and often require specialized loss functions and architectures to approximate quantities analogous to the score function. We begin by reviewing the mathematical formulation of discrete diffusion models and then introduce a framework that parallels continuous flow-based generative modeling. Specifically, we propose Binomial flows for non-negative ordinal data. We show that this approach provides a simple recipe for training, sampling, and computing exact likelihoods in discrete diffusion models via a discrete version of Tweedie's formula. Finally, we will demonstrate that sampling can be performed using a Poisson–Föllmer process, which has desirable theoretical properties and yields competitive performance on real-world image generation tasks.

For more information, please contact Jolene Brink by phone at (626)395-2813 or by email at [email protected] or visit CMX Website.