Re3 Generating Longer Stories With Recursive Reprompting and Revision Kevin Yang1Yuandong Tian2Nanyun Peng3Dan Klein1

2025-04-24 0 0 1.46MB 86 页 10玖币
侵权投诉
Re3: Generating Longer Stories With Recursive Reprompting and
Revision
Kevin Yang1Yuandong Tian2Nanyun Peng3Dan Klein1
1UC Berkeley, 2Meta AI, 3UCLA
{yangk,klein}@berkeley.edu,yuandong@meta.com,violetpeng@cs.ucla.edu
Abstract
We consider the problem of automatically gen-
erating longer stories of over two thousand
words. Compared to prior work on shorter sto-
ries, long-range plot coherence and relevance
are more central challenges here. We pro-
pose the Recursive Reprompting and Revision
framework (Re3) to address these challenges
by (a) prompting a general-purpose language
model to construct a structured overarching
plan, and (b) generating story passages by re-
peatedly injecting contextual information from
both the plan and current story state into a lan-
guage model prompt. We then revise by (c)
reranking different continuations for plot co-
herence and premise relevance, and finally (d)
editing the best continuation for factual con-
sistency. Compared to similar-length stories
generated directly from the same base model,
human evaluators judged substantially more of
Re3s stories as having a coherent overarching
plot (by 14% absolute increase), and relevant
to the given initial premise (by 20%).
1 Introduction
Generating long-term coherent stories is a long-
standing challenge for artificial intelligence, requir-
ing a comprehensive grasp of linguistic, world, and
commonsense knowledge (Charniak,1972;Turner,
1994). Recently, many works have automatically
generated short stories ranging in length from five
sentences to one or two paragraphs (Fan et al.,
2018;Yao et al.,2019;Goldfarb-Tarrant et al.,
2020;Rashkin et al.,2020;Han et al.,2022). While
stories of such length serve as a good test bed for
text generation, they are much shorter than typical
short stories meant for human consumption, which
are often several pages in length.
In this work, we aim to bridge some of this gap
by generating much longer “short” stories: the final
generated stories in our experiments are 2000-2500
words. We are the first to automatically generate
plot-coherent stories of such length, with further
Generate a setting, characters, and
outline by prompting a language model.
A new law grad returns home to start
her career, but struggles with the
broken justice system.
Liza Turner pulled up in front of the
house where she’d grown up. Little had
changed since she was a teenager...
Write story continuations by prompting
based on the plan and previous story.
Rerank story continuations for plot
coherence and premise relevance.
Edit selected continuation to maintain
long-range factual consistency.
Plan
Draft
Rewrite
Edit
Premise
Story
Figure 1: High-level overview of Re3.
length increases limited primarily by evaluation
rather than technical issues.
1
Generating stories
of such length faces qualitatively new challenges
compared to prior work on shorter stories. First, the
system must maintain a coherent overarching plot
over thousands of words. Given an initial premise,
it should maintain relevance to this premise over
thousands of words as well. Additional challenges
include preservation of narration style and avoiding
factual contradictions over a very long horizon.
Of course, recent years have also witnessed a
dramatic rise in the capabilities of general-purpose
(non-finetuned) large pretrained language models.
Of particular note are their strong zero-shot capabil-
ities, especially when given clever prompts (Brown
et al.,2020;Kojima et al.,2022). Yet despite recent
improvements, even the best models to date may
still struggle with complex long-form generation,
such as in our story generation task (Section 4).
In contrast, human writers successfully navigate
the myriad challenges of long-form generation on
a regular basis. We observe that a human writer
does not simply write a long document in one shot.
1We generate a 7500-word story in Appendix M.
arXiv:2210.06774v3 [cs.CL] 22 Oct 2022
Rather, he or she may (a) create a detailed plan,
then (b) draft each next passage of the document
according to that plan. He or she may then revise
by (c) rewriting passages entirely, and/or (d) post-
editing for finer details.
Motivated by this observation, we propose the
Re
cursive
Re
prompting and
Re
vision framework
(Re
3
, Figure 1) to generate longer stories. While
based on the human writing process, Re
3
is a fully
automatic system with no human intervention, un-
like prior approaches which model the human writ-
ing process with a human in the loop (Goldfarb-
Tarrant et al.,2019;Coenen et al.,2021;Lee et al.,
2022). First, (a) Re
3
s Plan module generates a plan
by prompting GPT3 (Brown et al.,2020) to aug-
ment a given premise with a setting, characters, and
outline. (b) Re
3
s Draft module then generates each
next story continuation by recursively reprompting
GPT3 using a strategically crafted prompt, in a
procedure which can be viewed as a generaliza-
tion of chain-of-thought prompting (Kojima et al.,
2022). Specifically, our prompt is dynamically re-
constructed at each step by selectively manifesting
contextually relevant information from the initial
plan—itself generated by prompting—and the story
thus far. We then divide the revision process into (c)
a Rewrite module which emulates a full rewrite by
reranking alternate continuations, and (d) an Edit
module which makes smaller local edits to improve
factual consistency with previous passages.
As an additional contribution, our Plan and Draft
modules are fully zero-shot rather than trained on
existing story datasets. Thus not only does Re
3
generate stories an order of magnitude longer than
those of prior work, but it is not limited to any
particular training domain.
To evaluate Re
3
for longer story generation, we
compare its generated stories to similar-length sto-
ries from two GPT3-based “rolling-window” base-
lines (Section 4). In pairwise comparisons, human
evaluators rated stories from Re
3
as significantly
and substantially more coherent in overarching plot
(up to 14% absolute increase in the fraction deemed
coherent), as well as relevant to the initial premise
(up to 20%). In fact, evaluators predicted up to
83% of stories written by Re
3
to be written by hu-
mans. The results indicate that Re
3
can be highly
effective at improving long-range coherence and
premise relevance in longer story generation.2
2
All code and data available at
https://github.com/
yangkevin2/emnlp22-re3-story-generation.
2 Related Work
Automatic Story Generation.
Several previous
works have modeled parts of our proposed writing
process, usually one part at a time.
Most similar to our Plan module are approaches
using an outline or structured schema to main-
tain plot coherence (Li et al.,2013;Fan et al.,
2018;Yao et al.,2019;Goldfarb-Tarrant et al.,
2020;Rashkin et al.,2020;Tian and Peng,2022).
Other methods for high-level planning include la-
tent variables (Miao and Blunsom,2016;Wang
and Wan,2019;Wang et al.,2022), coarse-to-fine
slot-filling (Fan et al.,2019), and keywords and/or
control codes (Peng et al.,2018;Ippolito et al.,
2019;Xu et al.,2020;Lin and Riedl,2021).
Meanwhile, our Rewrite module uses rerankers
similar to Guan et al. (2020) and Wang et al. (2020),
although we model both coherence and premise
relevance. Yu et al. (2020) iteratively edits and
improves the output like our Edit module, but we
additionally detect when edits are required.
We emphasize again the length of stories we
aim to generate. In prior studies, out-of-the-box
language models struggled to generate even very
short stories (Holtzman et al.,2019;See et al.,
2019). Although there exist datasets of relatively
longer stories, such as WritingPrompts (Fan et al.,
2018) and STORIUM (Akoury et al.,2020), many
works still only focus on stories of about five sen-
tences (Wang and Wan,2019;Yao et al.,2019;
Qin et al.,2019;Wang et al.,2022), even when
using language models with hundreds of billions
of parameters (Xu et al.,2020). Some challenges
of generating longer stories are apparent in Wang
et al. (2022): their method generates high-quality
few-sentence stories, but their forced long text gen-
erations, while judged better than baselines’, re-
main confusing and repetitive. Moreover, maintain-
ing long-range plot coherence, premise relevance,
and factual consistency is substantially harder over
multiple-thousand-word horizons.
Human-In-The-Loop Story Generation
. In con-
trast to fully automatic approaches like Re
3
, sev-
eral recent works have proposed human-interactive
methods to maintain quality in longer stories (Co-
enen et al.,2021;Lee et al.,2022;Chung et al.,
2022). Such works commonly combine both plan-
ning and revision systems (Goldfarb-Tarrant et al.,
2019;Coenen et al.,2021). In principle, Re
3
is also
highly controllable via human interaction, as both
our planning and revision systems operate nearly
entirely in natural language space; however, we
focus on fully automatic generation in this work.
Prompting.
Numerous works have demonstrated
general-purpose language models’ strong zero-shot
ability on a wide variety of tasks via prompt-
ing (Brown et al.,2020;Zhong et al.,2021;Sanh
et al.,2021;Ouyang et al.,2022;Wu et al.,
2022). Careful prompt design can yield further
gains (Lee et al.,2021;Liu et al.,2021;Kojima
et al.,2022). However, most prompting methods
focus on shorter-answer tasks rather than long-form
generation. Instead of generating the output in one
shot, our recursive reprompting procedure treats
prompting as a subroutine to generate the final
output in conjunction with our planning and revi-
sion infrastructure. Compared to chain-of-thought
prompting approaches like Kojima et al. (2022),
Re
3
goes a step further by repeatedly re-composing
the prompt in modular fashion, dynamically recom-
bining the most contextually relevant parts of both
the high-level plan and the story thus far.
3 Recursive Reprompting and Revision
We now describe our Recursive Reprompting and
Revision framework (Re
3
), which decomposes
the human writing process into our Plan, Draft,
Rewrite, and Edit modules. See Appendix Kfor
concrete examples of each component in practice.
3.1 Plan Module
Plan
Premise: A new law grad returns home
to start her career, but struggles
with the broken justice system.
Premise
Setting: The story is set in
a small town in the United States.
Setting
1. Character Portrait:
Liza Turner is a 22-year-old woman.
2. Character Portrait:
Peyton Turner is Liza’s older sister.
Characters
Outline the main plot points of the
story.
1. Liza Turner graduates from law
school.
2. She moves back to her hometown to
start her career.
3. She struggles with the reality of
the broken justice system.
Outline
Figure 2:
Illustration of Re
3
s Plan module, which prompts a
language model to generate a setting, characters, and outline
based on the premise. Highlighting indicates generated text.
The Plan module augments a story premise with
a setting, characters, and outline (Figure 2).
The setting is a simple one-sentence extension of
the premise, obtained by using
The story is set
in
to prompt GPT3-Instruct-175B (Ouyang et al.,
2022), a version of GPT3 finetuned to better follow
human instructions. Next, we use GPT3-Instruct-
175B to generate up to three character names and
then descriptions, conditioned on the premise and
setting. For names, we do rejection sampling using
simple heuristics to filter out malformed outputs
(Appendix A). Finally, we prompt GPT3-Instruct-
175B to write a numbered outline of the story and
parse the output into a list of outline points, re-
sampling until the list is well-formed.
These plan components, themselves generated
by prompting, will be repeatedly reused to compose
prompts for generating story passages in the Draft
module; hence recursive reprompting.
3.2 Draft Module
Draft
Relevant context:
Liza Turner is a 22-year-old woman.
Peyton Turner is Liza’s older sister.
Relevant
Context
Previous story summary:
Liza Turner graduates from law school.
Previous
Sections’
Outlines
Immediately before current passage:
Liza Turner returns home to her small
town, feeling both familiar and
unsafe. She is unsure if she wants to
live there, but hesitant to leave.
Recent
Story
Summary
In the upcoming passage,
She moves back to her hometown to
start her career.
Upcoming
Section
Outline
Full text below:
She locked up the car and carried her
things into the house through the back
door in case Peyton was home still.
Auto-
regressive
Context
Figure 3:
Illustration of the prompt constructed in Re
3
s Draft
module to generate each next story continuation. Our recursive
reprompting approach combines pieces of the plan (blue) and
previously generated story (grey) into a single prompt by
concatenating the depicted components in order.
For each point of the outline, we will generate
several story passages before moving on to the
next outline point. Each passage is generated as a
fixed-length continuation from a structured prompt,
which is composed by our recursive reprompting
procedure as shown in Figure 3.
The prompt begins with a selection of “Rel-
evant Context” shown at the top of Figure 3.
As the story progresses, we dynamically update
the list of character descriptions using a named-
entity-recognition-based pipeline, which identifies
new entities from each new story passage using
Flair (Akbik et al.,2018) and writes descriptions
using GPT3-Instruct-175B. Thus “Relevant Con-
text” initially contains all of the premise, setting,
and characters shown in Figure 2, but subsequently
selects only what is most relevant to the most recent
story passage using a pretrained Dense Passage Re-
trieval (DPR) model (Karpukhin et al.,2020).
The remainder of the prompt can be viewed as
a coarse-to-fine description of the previous story,
following the intuition that an author needs detailed
information about the most recent passage but per-
haps only higher-level information about much
earlier passages. As shown in Figure 3, we in-
clude “Previous Sections’ Outlines” as a very high-
level summary of previous larger story sections,
followed by a “Recent Story Summary” written by
GPT3-Instruct-13B
3
of a few penultimate passages.
At the end we repeat verbatim the immediately pre-
ceding passage as “Autoregressive Context” from
which point the story should continue. Finally, to
enforce relevance to the current outline point, we
include the “Current Section Outline” in the prompt
just before “Autoregressive Context.
Finally, the full prompt is fed to GPT3-175B to
generate the next story passage.4
3.3 Rewrite Module
Rewrite
All the lights were off
and there was no sign of
Peyton. She shrugged and
decided to go out and
spend the rest of her
evening at one of New
York City’s many bars.
Draft
Continuation
1
Coherence
+
Relevance
-1.7
She knew Peyton was
probably working late at
his restaurant so he
wouldn't come home early
to see her, but she
wouldn't put it past him
to do it anyway.
Draft
Continuation
2
Coherence
+
Relevance
2.0
Figure 4:
Re
3
s Rewrite module reranks the Draft module’s
continuations for coherence and relevance.
The generator’s first output continuation is often
low-quality, even with the planning and recursive
reprompting in the Plan and Draft modules. Hu-
mans may encounter a similar problem after a first
draft, particularly upon receiving feedback from
others, and be forced to rewrite a passage altogether.
Our Rewrite module models this rewriting process
3
As economical usage of large language models is becom-
ing increasingly important (Strubell et al.,2019), we use the
13B model where we observe it is not substantially worse.
4
This step does not use GPT3-Instruct-175B, as we ob-
served in preliminary experiments that an earlier version of
GPT3-Instruct-175B would frequently repeat sections of the
prompt. Generators other than GPT3-175B are also possible
in principle: for example, retrieval-augmented architectures
like RAG (Lewis et al.,2020) or architectures designed for
long-range dependencies like S4 (Gu et al.,2021). However,
it is critical to use a sufficiently high-quality language model:
even scaling down to GPT3-13B resulted in noticeably less
coherent outputs in our preliminary experiments.
by reranking Draft module outputs based on coher-
ence with the previous passage and relevance to the
current outline point (Figure 4).
We note that this Rewrite module is the only
part of Re
3
which uses prior story data. All of the
modules which actually generate text (Plan, Draft,
and to some extent Edit) do not require prior data.
Coherence Reranker.
We train a discriminative
model to predict whether a continuation is coherent
with the previous story. As data, we split stories
from the WritingPrompts dataset (Fan et al.,2018)
into passages up to 1000 tokens long, labeling the
ending up to 200 tokens as the gold continuation.
Inspired by the contrastive learning setup of Wang
et al. (2020) and Guan et al. (2020), we obtain neg-
ative examples by replacing the gold continuation
with a random other continuation from either the
same story or a different one. We then finetune a
pretrained Longformer-Base (Beltagy et al.,2020)
to classify whether a continuation is the true con-
tinuation for a given passage.
Relevance Reranker.
We train a relevance model
with the same architecture as our coherence model
to predict whether a continuation is relevant to the
current outline point. We construct a dataset of
2000 training examples, where each example con-
sists of a 200-token story passage from Writing-
Prompts and a brief summary written by GPT3-
Instruct-13B. Negative examples are constructed
by selecting the summary of a different passage,
whether in the same story or a different one.
Additional Heuristics.
Finally, we filter out con-
tinuations with some writing problems which are
easy to detect via rule-based heuristics. For exam-
ple, we check for repetition issues, e.g., repeating
chunks of the structured prompt. Similarly, to main-
tain consistent narration, we filter out first person
continuations to enforce a consistent third person
perspective. Full details in Appendix B.
3.4 Edit Module
In contrast to the Rewrite module which reranks
complete alternate continuations, the Edit module
makes local edits to further refine a passage pro-
duced by careful planning, drafting, and rewriting.
Specifically, we aim to remove long-range fac-
tual inconsistencies. When a human detects a small
factual discontinuity upon proofreading, he or she
might simply edit the offending detail, rather than
making major changes to the high-level plan or do-
ing substantial rewriting. Our Edit module mimics
Edit
Peyton Turner
Peyton Turner is male.
Peyton works at a restaurant.
Inferred Facts
She knew Peyton was probably
working late at his restaurant so
he wouldn't come home early to see
her, but she wouldn't put it past
him to do it anyway.
Selected
Continuation
Peyton Turner
Younger sister Liza Turner
Gender female male
Workplace restaurant
Attribute
Dictionary
Edit so that
Peyton Turner is female.
Editing
Instruction
She knew Peyton was probably
working late at her restaurant so
she wouldn't come home early to see
her, but she wouldn't put it past
her to do it anyway.
Final Edited
Continuation
Figure 5:
Illustration of Re
3
s Edit module. Starting from the
Rewrite module’s best continuation, we infer natural language
facts about each character, and convert them to attribute-value
pairs. New values (blue) are added to the attribute dictionary,
and contradictory values (red) are corrected.
this process in two steps: detecting factual incon-
sistencies, and correcting them.
Detecting Factual Inconsistencies.
An inconsis-
tency involves two statements. As the number
of statement pairs scales quadratically with story
length, naively comparing all pairs can result in
a sea of false positive “contradictions” (Section
5.2). Flagging inconsistencies while avoiding false
positives requires overwhelming precision.
Task Framing. To make the task more tractable,
we focus on factual inconsistencies in character
attributes (e.g., age, occupation, relationship to an-
other character). At a high level, our detection
system maintains a compact knowledge base in the
form of Figure 5s “Attribute Dictionary” for each
character. With each new story passage, we check
for contradictions against only these attribute-value
dictionaries instead of all previous text. The dictio-
naries are then updated for the new passage, and
new dictionaries are created for new characters
when detected as described in Section 3.2.
Thus, the core of our detection system is a high-
precision information extraction procedure for ob-
taining attribute-value pairs for a given character
from a story passage. Rather than hard-coding a
fixed set of attributes, our system is inspired by
Open Information Extraction (Etzioni et al.,2008),
in order to capture the wide variety of possible
attributes which may be salient in different stories.
Implementation Details. We begin by prompt-
ing GPT3-Instruct-175B for a numbered list of
facts about the given character, shown as “Inferred
Facts” in Figure 5. Each fact is fed with a few-shot
prompt to GPT3-Instruct-13B to extract attribute
keys. We then prompt GPT3-Instruct-13B with
the fact and each attribute key to obtain complete
attribute-value pairs. In steps prone to hallucina-
tion, we generate three outputs and keep only those
which are repeated, or entailed by other outputs ac-
cording to a BART-Large-based (Lewis et al.,2019)
entailment model trained on MNLI (Williams et al.,
2018). See Appendix Cfor complete details on
information extraction, with example prompts.
Finally, we add new pairs to our dictionary, and
use the entailment model to flag contradictions be-
tween new and old values for the same key.
Correcting Factual Inconsistencies.
Once an in-
consistency is detected, we frame the task of cor-
recting it as controlled text editing. The original
natural language fact (i.e., “Inferred Facts” in Fig-
ure 5) from which we extracted the contradicted
attribute-value pair now becomes the basis for the
“Editing Instruction” in Figure 5. This instruction
is then fed along with the original continuation to
the beta GPT3 Edit API.
4 Evaluation
Task Setup.
We frame the task as generating a
story given a brief initial premise. As a “story” is
difficult to define in a rule-based manner, we do not
impose any rule-based constraints on acceptable
outputs, but will instead evaluate via several human-
annotated metrics as described later.
To generate the initial premises, we prompt
GPT3-Instruct-175B with high temperature to ac-
quire 100 diverse premises.
5
All premises and sto-
ries are in English.
Method Instantiation.
For fair comparison, it is
desirable for the concrete implementation (hence-
forth RE
3
) of our Re
3
framework to output stories
of consistent length. While Re
3
is capable of gen-
erating shorter or longer stories (see e.g., our 7500-
word example in Appendix M), here we aim for
roughly 3000 tokens (2000-2500 words).
6
Thus
we re-sample the initial outlines (Section 3.1) until
they contain exactly three points, and generate ex-
actly four 256-token continuations for each outline
5
Combining this simple premise generation scheme with
Re
3
yields a story generation system which operates fully
from scratch, with no input premise required.
6
See Appendix Ffor analysis on how story length may
impact quality.
摘要:

Re3:GeneratingLongerStoriesWithRecursiveRepromptingandRevisionKevinYang1YuandongTian2NanyunPeng3DanKlein11UCBerkeley,2MetaAI,3UCLA{yangk,klein}@berkeley.edu,yuandong@meta.com,violetpeng@cs.ucla.eduAbstractWeconsidertheproblemofautomaticallygen-eratinglongerstoriesofovertwothousandwords.Comparedtopri...

展开>> 收起<<
Re3 Generating Longer Stories With Recursive Reprompting and Revision Kevin Yang1Yuandong Tian2Nanyun Peng3Dan Klein1.pdf

共86页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:86 页 大小:1.46MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 86
客服
关注