MARKUP -TO-IMAGE DIFFUSION MODELS WITH SCHEDULED SAMPLING Yuntian Deng1 Noriyuki Kojima2 Alexander M. Rush2

2025-05-02 2 0 1.02MB 15 页 10玖币

侵权投诉

MARKUP-TO-IMAGE DIFFUSION MODELS

WITH SCHEDULED SAMPLING

Yuntian Deng1, Noriyuki Kojima2, Alexander M. Rush2

1Harvard University dengyuntian@seas.harvard.edu

2Cornell University {nk654,arush}@cornell.edu

ABSTRACT

Building on recent advances in image generation, we present a fully data-driven

approach to rendering markup into images. The approach is based on diffusion

models, which parameterize the distribution of data using a sequence of denois-

ing operations on top of a Gaussian noise distribution. We view the diffusion

denoising process as a sequential decision making process, and show that it ex-

hibits compounding errors similar to exposure bias issues in imitation learning

problems. To mitigate these issues, we adapt the scheduled sampling algorithm

to diffusion training. We conduct experiments on four markup datasets: math-

ematical formulas (LaTeX), table layouts (HTML), sheet music (LilyPond), and

molecular images (SMILES). These experiments each verify the effectiveness of

the diffusion process and the use of scheduled sampling to ﬁx generation issues.

These results also show that the markup-to-image task presents a useful controlled

compositional setting for diagnosing and analyzing generative image models.

1 INTRODUCTION

Recent years have witnessed rapid progress in text-to-image generation with the development and

deployment of pretrained image/text encoders (Radford et al., 2021; Raffel et al., 2020) and pow-

erful generative processes such as denoising diffusion probabilistic models (Sohl-Dickstein et al.,

2015; Ho et al., 2020). Most existing image generation research focuses on generating realistic im-

ages conditioned on possibly ambiguous natural language (Nichol et al., 2021; Saharia et al., 2022;

Ramesh et al., 2022). In this work, we instead study the task of markup-to-image generation, where

the presentational markup describes exactly one-to-one what the ﬁnal image should look like.

While the task of markup-to-image generation can be accomplished with standard renderers, we

argue that this task has several nice properties for acting as a benchmark for evaluating and analyzing

text-to-image generation models. First, the deterministic nature of the problem enables exposing and

analyzing generation issues in a setting with known ground truth. Second, the compositional nature

of markup language is nontrivial for neural models to capture, making it a challenging benchmark

for relational properties. Finally, developing a model-based markup renderer enables interesting

applications such as markup compilers that are resilient to typos, or even enable mixing natural and

structured commands (Glennie, 1960; Teitelman, 1972).

We build a collection of markup-to-image datasets shown in Figure 1: mathematical formulas, table

layouts, sheet music, and molecules (Nienhuys & Nieuwenhuizen, 2003; Weininger, 1988). These

datasets can be used to assess the ability of generation models to produce coherent outputs in a

structured environment. We then experiment with utilizing diffusion models, which represent the

current state-of-the-art in conditional generation of realistic images, on these tasks.

The markup-to-image challenge exposes a new class of generation issues. For example, when gen-

erating formulas, current models generate perfectly formed output, but often generate duplicate or

misplaced symbols (see Figure 2). This type of error is similar to the widely studied exposure bias

issue in autoregressive text generation (Ranzato et al., 2015). To help the model ﬁx this class of

errors during the generation process, we propose to adapt scheduled sampling (Bengio et al., 2015).

arXiv:2210.05147v1 [cs.LG] 11 Oct 2022

Math

\widetilde \gamma _{\mathrm {hop

f}}\simeq \sum _{n>0}\widetilde {

G}_{n}{\frac {(-a)ˆ{n}}{

2ˆ{2n-1}}}

Table Layouts

... <span style=" font-weight:bold;

text-align:center; font-size:150%; " > f j

</span> </div> ...

Sheet Music

\relative c’’ { \time 4/4 d4 | r2b4

b2|ces4b4˜ g2f4|a4d8|e4g16 g2f2

r4|des2d8d8f8e4d8a16 b16|d4e2

d2. a8˜ g4r16˜ e16. d2f4b4e2|f4.|b

16 a16 e4. r2˜ c4r4b4d8b2|d4|r8. e

8e2|r8˜ e2 }

Molecules

COc1ccc(cc1N)C(=O)Nc2ccccc2

Figure 1: Markup-to-Image suite with generated images. Tasks include mathematical formulas

(LaTeX), table layouts (HTML), sheet music (LilyPond), and molecular images (SMILES). Each

example is conditioned on a markup (bottom) and produces a rendered image (top). Evaluation

directly compares the rendered image with the ground truth image.

Speciﬁcally, we train diffusion models by using the model’s own generations as input such that the

model learns to correct its own mistakes.

Experiments on all four datasets show that the proposed scheduled sampling approach improves

the generation quality compared to baselines, and generates images of surprisingly good quality for

these tasks. Models produce clearly recognizable images for all domains, and often do very well at

representing the semantics of the task. Still, there is more to be done to ensure faithful and consistent

generation in these difﬁcult deterministic settings. All models, data, and code are publicly available

at https://github.com/da03/markup2im.

2 MOTIVATION: DIFFUSION MODELS FOR MARKUP-TO-IMAGE

GENERATION

Task We deﬁne the task of markup-to-image generation as converting a source in a markup

language describing an image to that target image. The input is a sequence of Mtokens x=

x1,··· , xM∈ X, and the target is an image y∈ Y ⊆ RH×Wof height Hand width W(for

simplicity we only consider grayscale images here). The task of rendering is deﬁned as a mapping

f:X → Y. Our goal is to approximate the rendering function using a model parameterized by

θ fθ:X → Y trained on supervised examples {(xi, yi) : i∈ {1,2,··· , N}}. To make the task

tangible, we show several examples of x, y pairs in Figure 1.

Challenge The markup-to-image task contains several challenging properties that are not present

in other image generation benchmarks. While the images are much simpler, they act more dis-

cretely than typical natural images. Layout mistakes by the model can lead to propagating errors

throughout the image. For example, including an extra mathematical symbol can push everything

one line further down. Some datasets also have long-term symbolic dependencies, which may be

difﬁcult for non-sequential models to handle, analogous to some of the challenges observed in non-

autoregressive machine translation (Gu et al., 2018).

150

250

350

450

550

650

750

1000

Figure 2: The generation process of diffusion (left) versus diffusion+schedule sampling (right).

The numbers on the y-axis are the number of diffusion steps (T−t). The ground truth LaTeX is

\gamma_{n}ˆ{\mu}=\alpha_{n}ˆ{\mu}+\tilde{\alpha}_{n}ˆ{\mu},˜˜˜n\neq0.

Generation with Diffusion Models Denoising diffusion probabilistic models (DDPM) (Ho et al.,

2020) parameterize a probabilistic distribution P(y0|x)as a Markov chain P(yt−1|yt)with an initial

distribution P(yT). These models conditionally generate an image by sampling iteratively from the

following distribution (we omit the dependence on xfor simplicity):

P(yT) = N(0, I)

P(yt−1|yt) = N(µθ(yt, t); σ2

tI)

where y1, y2,··· , yTare latent variables of the same size as y0∈ Y,µθ(·, t)is a neural network

parameterizing a map Y → Y.

Diffusion models have proven to be effective for generating realistic images (Nichol et al., 2021;

Saharia et al., 2022; Ramesh et al., 2022) and are more stable to train than alternative approaches

for image generation such as Generative Adversarial Networks (Goodfellow et al., 2014). Diffusion

models are surprisingly effective on the markup-to-image datasets as well. However, despite gener-

ating realistic images, they make major mistakes in the layout and positioning of the symbols. For

an example of these mistakes see Figure 2 (left).

We attribute these mistakes to error propagation in the sequential Markov chain. Small mistakes

early in the sampling process can lead to intermediate ytstates that may have diverged signiﬁcantly

far from the model’s observed distribution during training. This issue has been widely studied in the

inverse RL and autoregressive token generation literature, where it is referred to as exposure bias

(Ross et al., 2011; Ranzato et al., 2015).

3 SCHEDULED SAMPLING FOR DIFFUSION MODELS

In this work, we adapt scheduled sampling, a simple and effective method based on DAgger (Ross

et al., 2011; Bengio et al., 2015) from discrete autoregressive models to the training procedure of

diffusion models. The core idea is to replace the standard training procedure with a biased sampling

approach that mimics the test-time model inference based on its own predictions. Before describing

this approach, we ﬁrst give a short background on training diffusion models.

Background: Training Diffusion Models Diffusion models maximize an evidence lower bound

(ELBO) on the above Markov chain. We introduce an auxiliary Markov chain Q(y1,··· , yT|y0) =

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MARKUP-TO-IMAGEDIFFUSIONMODELSWITHSCHEDULEDSAMPLINGYuntianDeng1,NoriyukiKojima2,AlexanderM.Rush21HarvardUniversitydengyuntian@seas.harvard.edu2CornellUniversityfnk654,arushg@cornell.eduABSTRACTBuildingonrecentadvancesinimagegeneration,wepresentafullydata-drivenapproachtorenderingmarkupintoimages.The...

展开>> 收起<<

MARKUP -TO-IMAGE DIFFUSION MODELS WITH SCHEDULED SAMPLING Yuntian Deng1 Noriyuki Kojima2 Alexander M. Rush2.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MARKUP -TO-IMAGE DIFFUSION MODELS WITH SCHEDULED SAMPLING Yuntian Deng1 Noriyuki Kojima2 Alexander M. Rush2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: