Visualize Before You Write Imagination-Guided Open-Ended Text Generation Wanrong Zhu An Yan Yujie Lu Wenda Xu

2025-05-06 0 0 6.33MB 15 页 10玖币
侵权投诉
Visualize Before You Write:
Imagination-Guided Open-Ended Text Generation
Wanrong Zhu, An Yan, Yujie Lu, Wenda Xu,
Xin Eric Wang§,Miguel Eckstein,William Yang Wang
UC Santa Barbara, UC San Diego, §UC Santa Cruz
{wanrongzhu,yujielu,wendaxu,william}@cs.ucsb.edu, ayan@ucsd.edu
xwang366@ucsc.edu, miguel.eckstein@psych.ucsb.edu
Abstract
Recent advances in text-to-image synthesis
make it possible to visualize machine imagi-
nations for a given context. On the other hand,
when generating text, human writers are gifted
at creative visualization, which enhances their
writings by forming imaginations as blueprints
before putting down the stories in words. In-
spired by such a cognitive process, we ask the
natural question of whether we can endow ma-
chines with the same ability to utilize visual
information and construct a general picture of
the context to guide text generation. In this
work, we propose iNLG that uses machine-
generated images to guide language models
(LM) in open-ended text generation. The ex-
periments and analyses demonstrate the effec-
tiveness of iNLG on open-ended text gener-
ation tasks, including text completion, story
generation, and concept-to-text generation in
both few-shot and full-data scenarios. Both au-
tomatic metrics and human evaluations verify
that the text snippets generated by our iNLG
are coherent and informative while displaying
minor degeneration.1
1 Introduction
One great resource human writers cherish is the
ability of imagination, with which they render men-
tal images about an actual or vicarious experience
and link knowledge that would later make the writ-
ing more concrete, sensible, and intriguing. Cog-
nitive studies show that visual imagery improves
comprehension during language processing (Gam-
brell and Bales,1986;Joffe et al.,2007;Sadoski
and Paivio,2000), and that mental imagery facili-
tates humans’ written language expression at young
ages (Gambrell and Koskinen,2002).
When it comes to the study of Artificial Intelli-
gence (AI), one classic challenge for AI systems
is to generate informative and coherent text snip-
pets. Open-ended text generation is such a task that
1Our code & data: https://github.com/VegB/iNLG.
Context: The individual adds chicken to the pan and cooks it. The
individual adds chopped onions and mushrooms to the pan and cooks
them. The individual adds some other ingredients…
Repetitive to the input context.
Not informative.
: and the individual adds them to the pan.
Text-only Input
: and stirs them into the soup.
Text Input + Visual Imagination
Machine
Imagina!on
Context 1: One of the guys hits the ball over to the other side and they
hit it back. Then on the other side of the beach there is a group of
women also playing volleyball. They…
(a1) Retrieved Image (b1) Generated Image
Context 2: A boy is talking to a camera. He goes into a bathroom and
drinks a cup of mouthwash. He…
(a2) Retrieved Image (b2) Generated Image
Figure 1: When performing open-ended text genera-
tion, the LMs prompted with text-only input may gener-
ate repetitive or unilluminating contents, which is also
known as degeneration. Hereby, we propose to use
machine-generated images as additional visual supervi-
sion to guide the LMs in generating more informative
and coherent text with the given context.
provides an input context, and asks the model to
generate a piece of text that is consistent with the
context. This is the cornerstone of a wide range of
downstream tasks such as text completion (Guan
et al.,2019;Radford et al.,2019), story genera-
tion (Fan et al.,2018;Goldfarb-Tarrant et al.,2020;
Swanson et al.,2021;Su et al.,2022b), and dia-
logue systems (Schatzmann et al.,2007;Wen et al.,
2015,2017;Wei et al.,2018;Wu et al.,2021), and
has received much attention throughout the years.
Inspired by human writers’ common practice of
creative visualization, we ask the following ques-
tion: Can we endow machines with the same ability
to construct a general picture of the context and use
it as a blueprint to guide text generation?
Recent advances in text-to-image generation
make it possible to visualize machine imaginations
for a given context (Ramesh et al.,2021;Rom-
bach et al.,2022;Crowson et al.,2022;Wang et al.,
2022b;Saharia et al.,2022). Moreover, this line
of work shows great potential in utilizing textual
information to guide image synthesis. It comes nat-
urally that one may attempt to complete the loop by
using visual supervision to guide text generation.
In this work, we propose using machine-
arXiv:2210.03765v4 [cs.CL] 15 Feb 2023
generated images to guide the language model
(LM) in open-ended text generation. More specif-
ically, we visualize machine imagination for the
input context by rendering images with StableD-
iffusion (Rombach et al.,2022), a state-of-the-art
text-to-image generator. The machine imagination
acts as additional visual supervision to guide LMs
in generating informative and coherent text in two
ways. Firstly, the machine-generated images are
introduced as the input to the LM in the form of the
visual prefix. Secondly, we designed a contrastive
training objective that enforces the generated text
to be semantically similar to the visual supervision.
We conduct experiments on three open-ended
text generation tasks, namely text completion, story
generation, and concept-to-text generation. Exten-
sive experiments in the few-shot settings show bet-
ter or competitive performance to state-of-the-art
baselines on both automatic metrics and human
evaluation. Experiments with full-data settings
show that introducing machine-generated visual
supervision with our iNLG yields consistent im-
provements on various LM models including GPT-
2 (Radford et al.,2019), BART (Lewis et al.,2020),
and T5 (Raffel et al.,2020).
Our main contributions are as follows:
We introduce a novel paradigm that lever-
ages machine-generated images to guide open-
ended text generation. This endows the ma-
chines with the ability of creative visualization
that human writers often demonstrate.
We distill the vision information from the pre-
trained multimodal models and further con-
struct visual prefixes to guide language mod-
els performing text generation with teacher
forcing and contrastive objectives.
Extensive experiments show the effective-
ness of iNLG as a model-agnostic framework
in open-ended text generation tasks, includ-
ing text completion, story generation, and
concept-to-text in both few-shot and full-data
settings.
2 Related Work
Open-ended Conditional Text Generation
is
the task of generating a coherent portion of the
text based on the given context. Recent advances
in pre-trained models have pushed frontier in the
open-ended conditional text generation, such as
text completion(See et al.,2019;Ippolito et al.,
2020), story generation (Guan et al.,2020;Fan
et al.,2018;Yao et al.,2019) and concept-to-text
generation (Zhou et al.,2021;Liu et al.,2021). De-
spite the success of large language models, text
degeneration and semantic coverage still remain
as two core technical challenges in few-shot open-
ended text generation. To improve the text cover-
age, StoryEndGen (Guan et al.,2019) leverages the
knowledge graph to encode context sequentially.
Fan et al. (2018) and Yao et al. (2019) plan the
content (premise or keywords) first and then en-
courage the generation based on planned content.
To mitigate the text degeneration, SimCTG (Su
et al.,2022b) uses a contrastive training strategy
to encourage the model to learn isotropic token
embeddings. Similar to our approach, Wang et al.
(2022a) generates a scene graph for each concept
and combines them with text for the model input.
Previous work has proposed to add visual informa-
tion to LM by retrieving images from the Internet
or large-scale image sets (Yang et al.,2020;Cho
et al.,2021;Su et al.,2022a). However, the re-
trieved images may fail to fully incorporate the
context, which will misguide the LM from yield-
ing contextually consistent predictions.
2
Unlike
prior work, our approach leverages images gener-
ated conditioning on the context to assist the text
generation process.
Visually-aided NLP
Recent work show the
power of visual guidance in natural language pro-
cessing, spanning from the language representation
learning (Lu et al.,2019;Li et al.,2019;Sun et al.,
2019;Luo et al.,2020;Chen et al.,2020;Li et al.,
2020;Tan and Bansal,2020;Lu et al.,2022), the
downstream tasks (Grubinger et al.,2006;Elliott
et al.,2016;Xie et al.,2019;Christie et al.,2016;
Shi et al.,2019;Lu et al.,2022) and evaluation (Zhu
et al.,2021). They either leverage visual informa-
tion from an external vision-and-language corpus
or obtain such visual knowledge from the large pre-
trained model. In this line of work, imagination
achieves promising performance in various NLP
domains (Long et al.,2021;Zhu et al.,2021;Wang
et al.,2022a;Lu et al.,2022). Previous imagination-
based work in NLP either study non-generation
problems (Zhu et al.,2021;Lu et al.,2022) or
utilize non-visual information (Long et al.,2021;
Wang et al.,2022a). Our work explores the poten-
tial of generating visual imagination to improve
open-ended text generation tasks.
2
Figure 8shows examples where the image retrieved from
the search engine is irrelevant with the input context.
Input Context x
A man is seen skiing behind a boat. He holds on tight as he is pulled through the water. The man …
y
̂
y
AE Decoder
CLIP
Visual
Encoder
Mapping
Network
Language
Model
Projection
Layer
̂
t
Diffusion
Model
Figure 2: An overview of our iNLG. Given an input context x, we first visualize the context with the text-to-image
generation model. Then we use the machine-generated image Ias the additional visual supervision to guide the
language model in open-ended text generation. The visual feature is provided as a source of input to the LM in the
form of the visual prefix. Aside from the teacher forcing objective Lteacher, we also enforce the LM to generate text
that is semantically similar to the machine imagination with a contrastive training objective Lcontrastive.
3 Method
3.1 Overview
Open-ended text generation is a task that provides
an input context, and asks the model to generate a
piece of text that is consistent with the context.
This work mainly focused on introducing
machine-rendered images to assist LM in perform-
ing open-ended text generation. More specifically,
given the context
xi
, we first use a text-to-image
generator to illustrate an image
Ii
that depicts the
input context. The LM is prompted with image
Ii
as the visual prefix along with the text context
xi
, and will incorporate the multimodal input to
generate the output text ˆ
yi.
Figure 2provides an overview of our iNLG
framework, which mainly involves two modules.
The first module is a text-to-image generator that
takes in the input context and illustrates a descrip-
tive image, which we also refer to as the machine
imagination. The second module is a visually-
guided language model that utilizes the machine
imagination as a source of input and also a supervi-
sion that encourages the LM to generate text that is
semantically similar to the visual information.
3.2 Text-to-Image Rendering
In this work, we propose to use images gener-
ated conditioning on the context by the machines
as additional visual information to the LM. The
text-to-image generation backbone is StableDiffu-
sion (Rombach et al.,2022), which mainly consists
of a text encoder, a diffusion model, and an au-
toencoder. The text encoder is from the frozen
CLIP ViT-L/14 (Radford et al.,2021) and encodes
the input text to textual embeddings. The diffu-
sion model uses UNet (Ronneberger et al.,2015)
to provide noise estimation. The UNet is modi-
fied so as to attend to the input textual embeddings.
The encoder of the pretrained autoencoder encodes
images into the lower-resolution latent maps
zT
.
At each step
t
, the diffusion model provides the
noise estimation
and modifies
zt
correspondingly.
The decoder of the pretrained autoencoder takes
the final noise-free latent map
z
and generates the
image prediction. StableDiffusion is trained with
LAION-5B (Schuhmann et al.,2022).
3.3 Visually Guided Text Generation
Visual Prefix Construction
One can encode the
visual information with the pre-trained visual mod-
els. However, such visual embedding may lie in a
representation space different from the LM due to
the discrepancy between models. One way of intro-
ducing features extracted by another network to the
current model is through feature mapping (Mokady
et al.,2021). With a dataset of image-text pairs
pI1,x1q
, we can pre-train a mapping network
F
for
a given LM in an image captioning formulation.
More specifically, we encode
I1
with the visual
encoder
Encvisual
and receive its visual features
v1
.
Then we apply the mapping network
F
over
v1
,
and receive a sequence of lvisual prefixes:
c1
1, c1
2, . . . , c1
lFpv1q “ FpEncvisualpI1qq (1)
摘要:

VisualizeBeforeYouWrite:Imagination-GuidedOpen-EndedTextGenerationWanrongZhu¶,AnYan†,YujieLu¶,WendaXu¶,XinEricWang§,MiguelEckstein¶,WilliamYangWang¶¶UCSantaBarbara,†UCSanDiego,§UCSantaCruz{wanrongzhu,yujielu,wendaxu,william}@cs.ucsb.edu,ayan@ucsd.eduxwang366@ucsc.edu,miguel.eckstein@psych.ucsb.eduAb...

展开>> 收起<<
Visualize Before You Write Imagination-Guided Open-Ended Text Generation Wanrong Zhu An Yan Yujie Lu Wenda Xu.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:6.33MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注