Visualize Before You Write Imagination-Guided Open-Ended Text Generation Wanrong Zhu An Yan Yujie Lu Wenda Xu

2025-05-06 0 0 6.33MB 15 页 10玖币

侵权投诉

Visualize Before You Write:

Imagination-Guided Open-Ended Text Generation

Wanrong Zhu¶, An Yan†, Yujie Lu¶, Wenda Xu¶,

Xin Eric Wang§,Miguel Eckstein¶,William Yang Wang¶

¶UC Santa Barbara, †UC San Diego, §UC Santa Cruz

{wanrongzhu,yujielu,wendaxu,william}@cs.ucsb.edu, ayan@ucsd.edu

xwang366@ucsc.edu, miguel.eckstein@psych.ucsb.edu

Abstract

Recent advances in text-to-image synthesis

make it possible to visualize machine imagi-

nations for a given context. On the other hand,

when generating text, human writers are gifted

at creative visualization, which enhances their

writings by forming imaginations as blueprints

before putting down the stories in words. In-

spired by such a cognitive process, we ask the

natural question of whether we can endow ma-

chines with the same ability to utilize visual

information and construct a general picture of

the context to guide text generation. In this

work, we propose iNLG that uses machine-

generated images to guide language models

(LM) in open-ended text generation. The ex-

periments and analyses demonstrate the effec-

tiveness of iNLG on open-ended text gener-

ation tasks, including text completion, story

generation, and concept-to-text generation in

both few-shot and full-data scenarios. Both au-

tomatic metrics and human evaluations verify

that the text snippets generated by our iNLG

are coherent and informative while displaying

minor degeneration.1

1 Introduction

One great resource human writers cherish is the

ability of imagination, with which they render men-

tal images about an actual or vicarious experience

and link knowledge that would later make the writ-

ing more concrete, sensible, and intriguing. Cog-

nitive studies show that visual imagery improves

comprehension during language processing (Gam-

brell and Bales,1986;Joffe et al.,2007;Sadoski

and Paivio,2000), and that mental imagery facili-

tates humans’ written language expression at young

ages (Gambrell and Koskinen,2002).

When it comes to the study of Artiﬁcial Intelli-

gence (AI), one classic challenge for AI systems

is to generate informative and coherent text snip-

pets. Open-ended text generation is such a task that

1Our code & data: https://github.com/VegB/iNLG.

Context: The individual adds chicken to the pan and cooks it. The

individual adds chopped onions and mushrooms to the pan and cooks

them. The individual adds some other ingredients…

Repetitive to the input context.

Not informative.

: and the individual adds them to the pan.

Text-only Input

: and stirs them into the soup.

Text Input + Visual Imagination

Machine

Imagina!on

Context 1: One of the guys hits the ball over to the other side and they

hit it back. Then on the other side of the beach there is a group of

women also playing volleyball. They…

(a1) Retrieved Image (b1) Generated Image

Context 2: A boy is talking to a camera. He goes into a bathroom and

drinks a cup of mouthwash. He…

(a2) Retrieved Image (b2) Generated Image

Figure 1: When performing open-ended text genera-

tion, the LMs prompted with text-only input may gener-

ate repetitive or unilluminating contents, which is also

known as degeneration. Hereby, we propose to use

machine-generated images as additional visual supervi-

sion to guide the LMs in generating more informative

and coherent text with the given context.

provides an input context, and asks the model to

generate a piece of text that is consistent with the

context. This is the cornerstone of a wide range of

downstream tasks such as text completion (Guan

et al.,2019;Radford et al.,2019), story genera-

tion (Fan et al.,2018;Goldfarb-Tarrant et al.,2020;

Swanson et al.,2021;Su et al.,2022b), and dia-

logue systems (Schatzmann et al.,2007;Wen et al.,

2015,2017;Wei et al.,2018;Wu et al.,2021), and

has received much attention throughout the years.

Inspired by human writers’ common practice of

creative visualization, we ask the following ques-

tion: Can we endow machines with the same ability

to construct a general picture of the context and use

it as a blueprint to guide text generation?

Recent advances in text-to-image generation

make it possible to visualize machine imaginations

for a given context (Ramesh et al.,2021;Rom-

bach et al.,2022;Crowson et al.,2022;Wang et al.,

2022b;Saharia et al.,2022). Moreover, this line

of work shows great potential in utilizing textual

information to guide image synthesis. It comes nat-

urally that one may attempt to complete the loop by

using visual supervision to guide text generation.

In this work, we propose using machine-

arXiv:2210.03765v4 [cs.CL] 15 Feb 2023

generated images to guide the language model

(LM) in open-ended text generation. More specif-

ically, we visualize machine imagination for the

input context by rendering images with StableD-

iffusion (Rombach et al.,2022), a state-of-the-art

text-to-image generator. The machine imagination

acts as additional visual supervision to guide LMs

in generating informative and coherent text in two

ways. Firstly, the machine-generated images are

introduced as the input to the LM in the form of the

visual preﬁx. Secondly, we designed a contrastive

training objective that enforces the generated text

to be semantically similar to the visual supervision.

We conduct experiments on three open-ended

text generation tasks, namely text completion, story

generation, and concept-to-text generation. Exten-

sive experiments in the few-shot settings show bet-

ter or competitive performance to state-of-the-art

baselines on both automatic metrics and human

evaluation. Experiments with full-data settings

show that introducing machine-generated visual

supervision with our iNLG yields consistent im-

provements on various LM models including GPT-

2 (Radford et al.,2019), BART (Lewis et al.,2020),

and T5 (Raffel et al.,2020).

Our main contributions are as follows:

•

We introduce a novel paradigm that lever-

ages machine-generated images to guide open-

ended text generation. This endows the ma-

chines with the ability of creative visualization

that human writers often demonstrate.

•

We distill the vision information from the pre-

trained multimodal models and further con-

struct visual preﬁxes to guide language mod-

els performing text generation with teacher

forcing and contrastive objectives.

•

Extensive experiments show the effective-

ness of iNLG as a model-agnostic framework

in open-ended text generation tasks, includ-

ing text completion, story generation, and

concept-to-text in both few-shot and full-data

settings.

2 Related Work

Open-ended Conditional Text Generation

the task of generating a coherent portion of the

text based on the given context. Recent advances

in pre-trained models have pushed frontier in the

open-ended conditional text generation, such as

text completion(See et al.,2019;Ippolito et al.,

2020), story generation (Guan et al.,2020;Fan

et al.,2018;Yao et al.,2019) and concept-to-text

generation (Zhou et al.,2021;Liu et al.,2021). De-

spite the success of large language models, text

degeneration and semantic coverage still remain

as two core technical challenges in few-shot open-

ended text generation. To improve the text cover-

age, StoryEndGen (Guan et al.,2019) leverages the

knowledge graph to encode context sequentially.

Fan et al. (2018) and Yao et al. (2019) plan the

content (premise or keywords) ﬁrst and then en-

courage the generation based on planned content.

To mitigate the text degeneration, SimCTG (Su

et al.,2022b) uses a contrastive training strategy

to encourage the model to learn isotropic token

embeddings. Similar to our approach, Wang et al.

(2022a) generates a scene graph for each concept

and combines them with text for the model input.

Previous work has proposed to add visual informa-

tion to LM by retrieving images from the Internet

or large-scale image sets (Yang et al.,2020;Cho

et al.,2021;Su et al.,2022a). However, the re-

trieved images may fail to fully incorporate the

context, which will misguide the LM from yield-

ing contextually consistent predictions.

Unlike

prior work, our approach leverages images gener-

ated conditioning on the context to assist the text

generation process.

Visually-aided NLP

Recent work show the

power of visual guidance in natural language pro-

cessing, spanning from the language representation

learning (Lu et al.,2019;Li et al.,2019;Sun et al.,

2019;Luo et al.,2020;Chen et al.,2020;Li et al.,

2020;Tan and Bansal,2020;Lu et al.,2022), the

downstream tasks (Grubinger et al.,2006;Elliott

et al.,2016;Xie et al.,2019;Christie et al.,2016;

Shi et al.,2019;Lu et al.,2022) and evaluation (Zhu

et al.,2021). They either leverage visual informa-

tion from an external vision-and-language corpus

or obtain such visual knowledge from the large pre-

trained model. In this line of work, imagination

achieves promising performance in various NLP

domains (Long et al.,2021;Zhu et al.,2021;Wang

et al.,2022a;Lu et al.,2022). Previous imagination-

based work in NLP either study non-generation

problems (Zhu et al.,2021;Lu et al.,2022) or

utilize non-visual information (Long et al.,2021;

Wang et al.,2022a). Our work explores the poten-

tial of generating visual imagination to improve

open-ended text generation tasks.

Figure 8shows examples where the image retrieved from

the search engine is irrelevant with the input context.

Input Context x

A man is seen skiing behind a boat. He holds on tight as he is pulled through the water. The man …

Target : is water skiing until the end of the clip.

Prediction : then moves to the side and begins to swim.

AE Decoder

AE Encoder

CLIP

Visual

Encoder

...

Mapping

Network

...

Language

Model

Projection

Layer

Machine Imagination I

c1 c2 cl t1 t2 tm

Text Emb.

Visual Prefix

Input Visual Feature

L"acherLcon#as!ve

Predicted Text Feature

Text-to-Image

Generation

Visually-Guided Text Generation

Diffusion

Model

Text Encoder

randomly

initialized

image

Figure 2: An overview of our iNLG. Given an input context x, we ﬁrst visualize the context with the text-to-image

generation model. Then we use the machine-generated image Ias the additional visual supervision to guide the

language model in open-ended text generation. The visual feature is provided as a source of input to the LM in the

form of the visual preﬁx. Aside from the teacher forcing objective Lteacher, we also enforce the LM to generate text

that is semantically similar to the machine imagination with a contrastive training objective Lcontrastive.

3 Method

3.1 Overview

Open-ended text generation is a task that provides

an input context, and asks the model to generate a

piece of text that is consistent with the context.

This work mainly focused on introducing

machine-rendered images to assist LM in perform-

ing open-ended text generation. More speciﬁcally,

given the context

, we ﬁrst use a text-to-image

generator to illustrate an image

that depicts the

input context. The LM is prompted with image

as the visual preﬁx along with the text context

, and will incorporate the multimodal input to

generate the output text ˆ

yi.

Figure 2provides an overview of our iNLG

framework, which mainly involves two modules.

The ﬁrst module is a text-to-image generator that

takes in the input context and illustrates a descrip-

tive image, which we also refer to as the machine

imagination. The second module is a visually-

guided language model that utilizes the machine

imagination as a source of input and also a supervi-

sion that encourages the LM to generate text that is

semantically similar to the visual information.

3.2 Text-to-Image Rendering

In this work, we propose to use images gener-

ated conditioning on the context by the machines

as additional visual information to the LM. The

text-to-image generation backbone is StableDiffu-

sion (Rombach et al.,2022), which mainly consists

of a text encoder, a diffusion model, and an au-

toencoder. The text encoder is from the frozen

CLIP ViT-L/14 (Radford et al.,2021) and encodes

the input text to textual embeddings. The diffu-

sion model uses UNet (Ronneberger et al.,2015)

to provide noise estimation. The UNet is modi-

ﬁed so as to attend to the input textual embeddings.

The encoder of the pretrained autoencoder encodes

images into the lower-resolution latent maps

At each step

, the diffusion model provides the

noise estimation



and modiﬁes

correspondingly.

The decoder of the pretrained autoencoder takes

the ﬁnal noise-free latent map

and generates the

image prediction. StableDiffusion is trained with

LAION-5B (Schuhmann et al.,2022).

3.3 Visually Guided Text Generation

Visual Preﬁx Construction

One can encode the

visual information with the pre-trained visual mod-

els. However, such visual embedding may lie in a

representation space different from the LM due to

the discrepancy between models. One way of intro-

ducing features extracted by another network to the

current model is through feature mapping (Mokady

et al.,2021). With a dataset of image-text pairs

pI1,x1q

, we can pre-train a mapping network

for

a given LM in an image captioning formulation.

More speciﬁcally, we encode

with the visual

encoder

Encvisual

and receive its visual features

Then we apply the mapping network

over

and receive a sequence of lvisual preﬁxes:

1, c1

2, . . . , c1

l“Fpv1q “ FpEncvisualpI1qq (1)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VisualizeBeforeYouWrite:Imagination-GuidedOpen-EndedTextGenerationWanrongZhu¶,AnYan,YujieLu¶,WendaXu¶,XinEricWang§,MiguelEckstein¶,WilliamYangWang¶¶UCSantaBarbara,UCSanDiego,§UCSantaCruz{wanrongzhu,yujielu,wendaxu,william}@cs.ucsb.edu,ayan@ucsd.eduxwang366@ucsc.edu,miguel.eckstein@psych.ucsb.eduAb...

展开>> 收起<<

Visualize Before You Write Imagination-Guided Open-Ended Text Generation Wanrong Zhu An Yan Yujie Lu Wenda Xu.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Visualize Before You Write Imagination-Guided Open-Ended Text Generation Wanrong Zhu An Yan Yujie Lu Wenda Xu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: