MARKUP -TO-IMAGE DIFFUSION MODELS WITH SCHEDULED SAMPLING Yuntian Deng1 Noriyuki Kojima2 Alexander M. Rush2

2025-05-02 0 0 1.02MB 15 页 10玖币
侵权投诉
MARKUP-TO-IMAGE DIFFUSION MODELS
WITH SCHEDULED SAMPLING
Yuntian Deng1, Noriyuki Kojima2, Alexander M. Rush2
1Harvard University dengyuntian@seas.harvard.edu
2Cornell University {nk654,arush}@cornell.edu
ABSTRACT
Building on recent advances in image generation, we present a fully data-driven
approach to rendering markup into images. The approach is based on diffusion
models, which parameterize the distribution of data using a sequence of denois-
ing operations on top of a Gaussian noise distribution. We view the diffusion
denoising process as a sequential decision making process, and show that it ex-
hibits compounding errors similar to exposure bias issues in imitation learning
problems. To mitigate these issues, we adapt the scheduled sampling algorithm
to diffusion training. We conduct experiments on four markup datasets: math-
ematical formulas (LaTeX), table layouts (HTML), sheet music (LilyPond), and
molecular images (SMILES). These experiments each verify the effectiveness of
the diffusion process and the use of scheduled sampling to fix generation issues.
These results also show that the markup-to-image task presents a useful controlled
compositional setting for diagnosing and analyzing generative image models.
1 INTRODUCTION
Recent years have witnessed rapid progress in text-to-image generation with the development and
deployment of pretrained image/text encoders (Radford et al., 2021; Raffel et al., 2020) and pow-
erful generative processes such as denoising diffusion probabilistic models (Sohl-Dickstein et al.,
2015; Ho et al., 2020). Most existing image generation research focuses on generating realistic im-
ages conditioned on possibly ambiguous natural language (Nichol et al., 2021; Saharia et al., 2022;
Ramesh et al., 2022). In this work, we instead study the task of markup-to-image generation, where
the presentational markup describes exactly one-to-one what the final image should look like.
While the task of markup-to-image generation can be accomplished with standard renderers, we
argue that this task has several nice properties for acting as a benchmark for evaluating and analyzing
text-to-image generation models. First, the deterministic nature of the problem enables exposing and
analyzing generation issues in a setting with known ground truth. Second, the compositional nature
of markup language is nontrivial for neural models to capture, making it a challenging benchmark
for relational properties. Finally, developing a model-based markup renderer enables interesting
applications such as markup compilers that are resilient to typos, or even enable mixing natural and
structured commands (Glennie, 1960; Teitelman, 1972).
We build a collection of markup-to-image datasets shown in Figure 1: mathematical formulas, table
layouts, sheet music, and molecules (Nienhuys & Nieuwenhuizen, 2003; Weininger, 1988). These
datasets can be used to assess the ability of generation models to produce coherent outputs in a
structured environment. We then experiment with utilizing diffusion models, which represent the
current state-of-the-art in conditional generation of realistic images, on these tasks.
The markup-to-image challenge exposes a new class of generation issues. For example, when gen-
erating formulas, current models generate perfectly formed output, but often generate duplicate or
misplaced symbols (see Figure 2). This type of error is similar to the widely studied exposure bias
issue in autoregressive text generation (Ranzato et al., 2015). To help the model fix this class of
errors during the generation process, we propose to adapt scheduled sampling (Bengio et al., 2015).
1
arXiv:2210.05147v1 [cs.LG] 11 Oct 2022
Math
\widetilde \gamma _{\mathrm {hop
f}}\simeq \sum _{n>0}\widetilde {
G}_{n}{\frac {(-a)ˆ{n}}{
2ˆ{2n-1}}}
Table Layouts
... <span style=" font-weight:bold;
text-align:center; font-size:150%; " > f j
</span> </div> ...
Sheet Music
\relative c’’ { \time 4/4 d4 | r2b4
b2|ces4bg2f4|a4d8|e4g16 g2f2
r4|des2d8d8f8e4d8a16 b16|d4e2
d2. ag4r16˜ e16. d2f4b4e2|f4.|b
16 a16 e4. rc4r4b4d8b2|d4|r8. e
8e2|re2 }
Molecules
COc1ccc(cc1N)C(=O)Nc2ccccc2
Figure 1: Markup-to-Image suite with generated images. Tasks include mathematical formulas
(LaTeX), table layouts (HTML), sheet music (LilyPond), and molecular images (SMILES). Each
example is conditioned on a markup (bottom) and produces a rendered image (top). Evaluation
directly compares the rendered image with the ground truth image.
Specifically, we train diffusion models by using the model’s own generations as input such that the
model learns to correct its own mistakes.
Experiments on all four datasets show that the proposed scheduled sampling approach improves
the generation quality compared to baselines, and generates images of surprisingly good quality for
these tasks. Models produce clearly recognizable images for all domains, and often do very well at
representing the semantics of the task. Still, there is more to be done to ensure faithful and consistent
generation in these difficult deterministic settings. All models, data, and code are publicly available
at https://github.com/da03/markup2im.
2 MOTIVATION: DIFFUSION MODELS FOR MARKUP-TO-IMAGE
GENERATION
Task We define the task of markup-to-image generation as converting a source in a markup
language describing an image to that target image. The input is a sequence of Mtokens x=
x1,··· , xM∈ X, and the target is an image y∈ Y RH×Wof height Hand width W(for
simplicity we only consider grayscale images here). The task of rendering is defined as a mapping
f:X → Y. Our goal is to approximate the rendering function using a model parameterized by
θ fθ:X → Y trained on supervised examples {(xi, yi) : i∈ {1,2,··· , N}}. To make the task
tangible, we show several examples of x, y pairs in Figure 1.
Challenge The markup-to-image task contains several challenging properties that are not present
in other image generation benchmarks. While the images are much simpler, they act more dis-
cretely than typical natural images. Layout mistakes by the model can lead to propagating errors
throughout the image. For example, including an extra mathematical symbol can push everything
one line further down. Some datasets also have long-term symbolic dependencies, which may be
difficult for non-sequential models to handle, analogous to some of the challenges observed in non-
autoregressive machine translation (Gu et al., 2018).
2
150
250
350
450
550
650
750
1000
Figure 2: The generation process of diffusion (left) versus diffusion+schedule sampling (right).
The numbers on the y-axis are the number of diffusion steps (Tt). The ground truth LaTeX is
\gamma_{n}ˆ{\mu}=\alpha_{n}ˆ{\mu}+\tilde{\alpha}_{n}ˆ{\mu},˜˜˜n\neq0.
Generation with Diffusion Models Denoising diffusion probabilistic models (DDPM) (Ho et al.,
2020) parameterize a probabilistic distribution P(y0|x)as a Markov chain P(yt1|yt)with an initial
distribution P(yT). These models conditionally generate an image by sampling iteratively from the
following distribution (we omit the dependence on xfor simplicity):
P(yT) = N(0, I)
P(yt1|yt) = N(µθ(yt, t); σ2
tI)
where y1, y2,··· , yTare latent variables of the same size as y0∈ Y,µθ(·, t)is a neural network
parameterizing a map Y → Y.
Diffusion models have proven to be effective for generating realistic images (Nichol et al., 2021;
Saharia et al., 2022; Ramesh et al., 2022) and are more stable to train than alternative approaches
for image generation such as Generative Adversarial Networks (Goodfellow et al., 2014). Diffusion
models are surprisingly effective on the markup-to-image datasets as well. However, despite gener-
ating realistic images, they make major mistakes in the layout and positioning of the symbols. For
an example of these mistakes see Figure 2 (left).
We attribute these mistakes to error propagation in the sequential Markov chain. Small mistakes
early in the sampling process can lead to intermediate ytstates that may have diverged significantly
far from the model’s observed distribution during training. This issue has been widely studied in the
inverse RL and autoregressive token generation literature, where it is referred to as exposure bias
(Ross et al., 2011; Ranzato et al., 2015).
3 SCHEDULED SAMPLING FOR DIFFUSION MODELS
In this work, we adapt scheduled sampling, a simple and effective method based on DAgger (Ross
et al., 2011; Bengio et al., 2015) from discrete autoregressive models to the training procedure of
diffusion models. The core idea is to replace the standard training procedure with a biased sampling
approach that mimics the test-time model inference based on its own predictions. Before describing
this approach, we first give a short background on training diffusion models.
Background: Training Diffusion Models Diffusion models maximize an evidence lower bound
(ELBO) on the above Markov chain. We introduce an auxiliary Markov chain Q(y1,··· , yT|y0) =
3
摘要:

MARKUP-TO-IMAGEDIFFUSIONMODELSWITHSCHEDULEDSAMPLINGYuntianDeng1,NoriyukiKojima2,AlexanderM.Rush21HarvardUniversitydengyuntian@seas.harvard.edu2CornellUniversityfnk654,arushg@cornell.eduABSTRACTBuildingonrecentadvancesinimagegeneration,wepresentafullydata-drivenapproachtorenderingmarkupintoimages.The...

展开>> 收起<<
MARKUP -TO-IMAGE DIFFUSION MODELS WITH SCHEDULED SAMPLING Yuntian Deng1 Noriyuki Kojima2 Alexander M. Rush2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.02MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注