MARKUP-TO-IMAGE DIFFUSION MODELS
WITH SCHEDULED SAMPLING
Yuntian Deng1, Noriyuki Kojima2, Alexander M. Rush2
1Harvard University dengyuntian@seas.harvard.edu
2Cornell University {nk654,arush}@cornell.edu
ABSTRACT
Building on recent advances in image generation, we present a fully data-driven
approach to rendering markup into images. The approach is based on diffusion
models, which parameterize the distribution of data using a sequence of denois-
ing operations on top of a Gaussian noise distribution. We view the diffusion
denoising process as a sequential decision making process, and show that it ex-
hibits compounding errors similar to exposure bias issues in imitation learning
problems. To mitigate these issues, we adapt the scheduled sampling algorithm
to diffusion training. We conduct experiments on four markup datasets: math-
ematical formulas (LaTeX), table layouts (HTML), sheet music (LilyPond), and
molecular images (SMILES). These experiments each verify the effectiveness of
the diffusion process and the use of scheduled sampling to fix generation issues.
These results also show that the markup-to-image task presents a useful controlled
compositional setting for diagnosing and analyzing generative image models.
1 INTRODUCTION
Recent years have witnessed rapid progress in text-to-image generation with the development and
deployment of pretrained image/text encoders (Radford et al., 2021; Raffel et al., 2020) and pow-
erful generative processes such as denoising diffusion probabilistic models (Sohl-Dickstein et al.,
2015; Ho et al., 2020). Most existing image generation research focuses on generating realistic im-
ages conditioned on possibly ambiguous natural language (Nichol et al., 2021; Saharia et al., 2022;
Ramesh et al., 2022). In this work, we instead study the task of markup-to-image generation, where
the presentational markup describes exactly one-to-one what the final image should look like.
While the task of markup-to-image generation can be accomplished with standard renderers, we
argue that this task has several nice properties for acting as a benchmark for evaluating and analyzing
text-to-image generation models. First, the deterministic nature of the problem enables exposing and
analyzing generation issues in a setting with known ground truth. Second, the compositional nature
of markup language is nontrivial for neural models to capture, making it a challenging benchmark
for relational properties. Finally, developing a model-based markup renderer enables interesting
applications such as markup compilers that are resilient to typos, or even enable mixing natural and
structured commands (Glennie, 1960; Teitelman, 1972).
We build a collection of markup-to-image datasets shown in Figure 1: mathematical formulas, table
layouts, sheet music, and molecules (Nienhuys & Nieuwenhuizen, 2003; Weininger, 1988). These
datasets can be used to assess the ability of generation models to produce coherent outputs in a
structured environment. We then experiment with utilizing diffusion models, which represent the
current state-of-the-art in conditional generation of realistic images, on these tasks.
The markup-to-image challenge exposes a new class of generation issues. For example, when gen-
erating formulas, current models generate perfectly formed output, but often generate duplicate or
misplaced symbols (see Figure 2). This type of error is similar to the widely studied exposure bias
issue in autoregressive text generation (Ranzato et al., 2015). To help the model fix this class of
errors during the generation process, we propose to adapt scheduled sampling (Bengio et al., 2015).
1
arXiv:2210.05147v1 [cs.LG] 11 Oct 2022