
Rather, he or she may (a) create a detailed plan,
then (b) draft each next passage of the document
according to that plan. He or she may then revise
by (c) rewriting passages entirely, and/or (d) post-
editing for finer details.
Motivated by this observation, we propose the
Re
cursive
Re
prompting and
Re
vision framework
(Re
3
, Figure 1) to generate longer stories. While
based on the human writing process, Re
3
is a fully
automatic system with no human intervention, un-
like prior approaches which model the human writ-
ing process with a human in the loop (Goldfarb-
Tarrant et al.,2019;Coenen et al.,2021;Lee et al.,
2022). First, (a) Re
3
’s Plan module generates a plan
by prompting GPT3 (Brown et al.,2020) to aug-
ment a given premise with a setting, characters, and
outline. (b) Re
3
’s Draft module then generates each
next story continuation by recursively reprompting
GPT3 using a strategically crafted prompt, in a
procedure which can be viewed as a generaliza-
tion of chain-of-thought prompting (Kojima et al.,
2022). Specifically, our prompt is dynamically re-
constructed at each step by selectively manifesting
contextually relevant information from the initial
plan—itself generated by prompting—and the story
thus far. We then divide the revision process into (c)
a Rewrite module which emulates a full rewrite by
reranking alternate continuations, and (d) an Edit
module which makes smaller local edits to improve
factual consistency with previous passages.
As an additional contribution, our Plan and Draft
modules are fully zero-shot rather than trained on
existing story datasets. Thus not only does Re
3
generate stories an order of magnitude longer than
those of prior work, but it is not limited to any
particular training domain.
To evaluate Re
3
for longer story generation, we
compare its generated stories to similar-length sto-
ries from two GPT3-based “rolling-window” base-
lines (Section 4). In pairwise comparisons, human
evaluators rated stories from Re
3
as significantly
and substantially more coherent in overarching plot
(up to 14% absolute increase in the fraction deemed
coherent), as well as relevant to the initial premise
(up to 20%). In fact, evaluators predicted up to
83% of stories written by Re
3
to be written by hu-
mans. The results indicate that Re
3
can be highly
effective at improving long-range coherence and
premise relevance in longer story generation.2
2
All code and data available at
https://github.com/
yangkevin2/emnlp22-re3-story-generation.
2 Related Work
Automatic Story Generation.
Several previous
works have modeled parts of our proposed writing
process, usually one part at a time.
Most similar to our Plan module are approaches
using an outline or structured schema to main-
tain plot coherence (Li et al.,2013;Fan et al.,
2018;Yao et al.,2019;Goldfarb-Tarrant et al.,
2020;Rashkin et al.,2020;Tian and Peng,2022).
Other methods for high-level planning include la-
tent variables (Miao and Blunsom,2016;Wang
and Wan,2019;Wang et al.,2022), coarse-to-fine
slot-filling (Fan et al.,2019), and keywords and/or
control codes (Peng et al.,2018;Ippolito et al.,
2019;Xu et al.,2020;Lin and Riedl,2021).
Meanwhile, our Rewrite module uses rerankers
similar to Guan et al. (2020) and Wang et al. (2020),
although we model both coherence and premise
relevance. Yu et al. (2020) iteratively edits and
improves the output like our Edit module, but we
additionally detect when edits are required.
We emphasize again the length of stories we
aim to generate. In prior studies, out-of-the-box
language models struggled to generate even very
short stories (Holtzman et al.,2019;See et al.,
2019). Although there exist datasets of relatively
longer stories, such as WritingPrompts (Fan et al.,
2018) and STORIUM (Akoury et al.,2020), many
works still only focus on stories of about five sen-
tences (Wang and Wan,2019;Yao et al.,2019;
Qin et al.,2019;Wang et al.,2022), even when
using language models with hundreds of billions
of parameters (Xu et al.,2020). Some challenges
of generating longer stories are apparent in Wang
et al. (2022): their method generates high-quality
few-sentence stories, but their forced long text gen-
erations, while judged better than baselines’, re-
main confusing and repetitive. Moreover, maintain-
ing long-range plot coherence, premise relevance,
and factual consistency is substantially harder over
multiple-thousand-word horizons.
Human-In-The-Loop Story Generation
. In con-
trast to fully automatic approaches like Re
3
, sev-
eral recent works have proposed human-interactive
methods to maintain quality in longer stories (Co-
enen et al.,2021;Lee et al.,2022;Chung et al.,
2022). Such works commonly combine both plan-
ning and revision systems (Goldfarb-Tarrant et al.,
2019;Coenen et al.,2021). In principle, Re
3
is also
highly controllable via human interaction, as both
our planning and revision systems operate nearly