
generated images to guide the language model
(LM) in open-ended text generation. More specif-
ically, we visualize machine imagination for the
input context by rendering images with StableD-
iffusion (Rombach et al.,2022), a state-of-the-art
text-to-image generator. The machine imagination
acts as additional visual supervision to guide LMs
in generating informative and coherent text in two
ways. Firstly, the machine-generated images are
introduced as the input to the LM in the form of the
visual prefix. Secondly, we designed a contrastive
training objective that enforces the generated text
to be semantically similar to the visual supervision.
We conduct experiments on three open-ended
text generation tasks, namely text completion, story
generation, and concept-to-text generation. Exten-
sive experiments in the few-shot settings show bet-
ter or competitive performance to state-of-the-art
baselines on both automatic metrics and human
evaluation. Experiments with full-data settings
show that introducing machine-generated visual
supervision with our iNLG yields consistent im-
provements on various LM models including GPT-
2 (Radford et al.,2019), BART (Lewis et al.,2020),
and T5 (Raffel et al.,2020).
Our main contributions are as follows:
•
We introduce a novel paradigm that lever-
ages machine-generated images to guide open-
ended text generation. This endows the ma-
chines with the ability of creative visualization
that human writers often demonstrate.
•
We distill the vision information from the pre-
trained multimodal models and further con-
struct visual prefixes to guide language mod-
els performing text generation with teacher
forcing and contrastive objectives.
•
Extensive experiments show the effective-
ness of iNLG as a model-agnostic framework
in open-ended text generation tasks, includ-
ing text completion, story generation, and
concept-to-text in both few-shot and full-data
settings.
2 Related Work
Open-ended Conditional Text Generation
is
the task of generating a coherent portion of the
text based on the given context. Recent advances
in pre-trained models have pushed frontier in the
open-ended conditional text generation, such as
text completion(See et al.,2019;Ippolito et al.,
2020), story generation (Guan et al.,2020;Fan
et al.,2018;Yao et al.,2019) and concept-to-text
generation (Zhou et al.,2021;Liu et al.,2021). De-
spite the success of large language models, text
degeneration and semantic coverage still remain
as two core technical challenges in few-shot open-
ended text generation. To improve the text cover-
age, StoryEndGen (Guan et al.,2019) leverages the
knowledge graph to encode context sequentially.
Fan et al. (2018) and Yao et al. (2019) plan the
content (premise or keywords) first and then en-
courage the generation based on planned content.
To mitigate the text degeneration, SimCTG (Su
et al.,2022b) uses a contrastive training strategy
to encourage the model to learn isotropic token
embeddings. Similar to our approach, Wang et al.
(2022a) generates a scene graph for each concept
and combines them with text for the model input.
Previous work has proposed to add visual informa-
tion to LM by retrieving images from the Internet
or large-scale image sets (Yang et al.,2020;Cho
et al.,2021;Su et al.,2022a). However, the re-
trieved images may fail to fully incorporate the
context, which will misguide the LM from yield-
ing contextually consistent predictions.
2
Unlike
prior work, our approach leverages images gener-
ated conditioning on the context to assist the text
generation process.
Visually-aided NLP
Recent work show the
power of visual guidance in natural language pro-
cessing, spanning from the language representation
learning (Lu et al.,2019;Li et al.,2019;Sun et al.,
2019;Luo et al.,2020;Chen et al.,2020;Li et al.,
2020;Tan and Bansal,2020;Lu et al.,2022), the
downstream tasks (Grubinger et al.,2006;Elliott
et al.,2016;Xie et al.,2019;Christie et al.,2016;
Shi et al.,2019;Lu et al.,2022) and evaluation (Zhu
et al.,2021). They either leverage visual informa-
tion from an external vision-and-language corpus
or obtain such visual knowledge from the large pre-
trained model. In this line of work, imagination
achieves promising performance in various NLP
domains (Long et al.,2021;Zhu et al.,2021;Wang
et al.,2022a;Lu et al.,2022). Previous imagination-
based work in NLP either study non-generation
problems (Zhu et al.,2021;Lu et al.,2022) or
utilize non-visual information (Long et al.,2021;
Wang et al.,2022a). Our work explores the poten-
tial of generating visual imagination to improve
open-ended text generation tasks.
2
Figure 8shows examples where the image retrieved from
the search engine is irrelevant with the input context.