
prepended to the title, and the model is trained to
generate the text plan, as shown in Figure 1.
Surface Realization.
Surface realization task
teaches the model to properly reflect the text plan
in the final target. We concatenate the task prompt
(e.g., “Conduct surface realization”), title and the
corresponding plan as the input sequence, which is
consumed by the model to generate the final target.
2.3 Reviewing Task
We propose two reviewing (Review.) tasks which
leverage negative samples to enhance the model
to better distinguish the coherent outputs from dis-
tracts, and learn to revise the flawed outputs.
Revise Task.
The revise task aims to empower the
model to edit the flawed outputs (Wang et al.,2018).
For each sample, we construct two flawed nega-
tives: (1) randomly shuffle the target sentences to
encourage model to learn correct sentence ordering,
and (2) replace the keyphrases in the target with
random keyphrases to enhance better content orga-
nization. The model takes as input the task prompt
(“Revising the Output”), title, and the flawed out-
put, and recovers the original target.
Distinguishing Task.
This task requires the model
to distinguish the original output from the dis-
tracted ones given an input. The distracted targets
are constructed with the same strategies as the Re-
vise Task. Similar to Zhou et al. (2020), the input
sequence is the concatenation of the task prompt
(e.g., “Which Option is Better”), the title, an output
with 50% to be the original target or a distracted
one otherwise. The model is trained to predict
whether the output is correct by generating “pos-
itive” or “negative”. By doing so, we expect the
model to give a preference of the coherent targets
and learn to generate better outputs.
2.4 Joint Training with Multi-tasks
We jointly train the aforementioned objectives with
shared parameters to reinforce the writing ability.
Specifically, given a source-target pair
(x, y)
, we
first construct two decomposed generation samples
for text planning and surface realization tasks re-
spectively. Then we construct two flawed samples
for the revise task. Finally, for the distinguishing
task, we choose the output with 50% to be the posi-
tive target or a distracted negative target otherwise.
All objectives are converted to text-to-text transfer
tasks, and jointly trained to maximize the likeli-
hood probability:
L=LGen. +LDecomp. +LReview.
.
Reddit/CMV Wikiplots NYTimes
# Train 42,462 95,571 103,579
# Dev 6,480 5,328 5,000
# Test 7,562 5,404 5,000
# Words 116.3 425.4 218.2
# Sent. 5.5 18.0 9.1
Table 1:
Statistics of the datasets. # Words denotes the aver-
age number of words in the target, and # Sent. represents the
average number of sentences.
During inference, we use the end-to-end generation
task to produce final outputs.
3 Experimental Setting
3.1 Datasets
We evaluate our model on three datasets of dis-
tinct domains: (1) Reddit/ChangeMyView (Red-
dit/CMV) for argument generation (Hua and
Wang,2020), (2) Wikiplots for story generation,
and (3) New Tork Times for news article writ-
ing (Sandhaus,2008). We follow the previous
work (Rashkin et al.,2020) to further include top-
ical keyphrases as guidance outline, where noun
and verb phrases which contain at least one topic
signature words (Lin and Hovy,2000) from targets
are extracted. The title and keyphrases are concate-
nated as the input
x
. The statistics are in Table 1,
and more details are in Appendix A.1.
3.2 Model Details
We use T5-base (Raffel et al.,2019) in all exper-
iments. During training, we optimize our model
with AdamW (Loshchilov and Hutter,2017), and
the learning rate is 5e-5. For decoding, we apply
nucleus sampling (Holtzman et al.,2019) with k
as 10 and p as 0.9. The maximum of generation
steps are 200 for argument generation, 512 for story
generation and 350 for NYT article generation.
Baselines.
We first consider generation mod-
els including GPT2 (Brown et al.,2020) and
T5 (Raffel et al.,2019) without multitask train-
ing. We also include strong planning-based meth-
ods: (1) CONTENTPLAN is a two-step genera-
tion model (Goldfarb-Tarrant et al.,2020;Hua and
Wang,2020), where a planner first produces or-
dered keyphrase plans, and a generator consumes
the plans and generates final outputs; (2) BOW-
PLAN (Kang and Hovy,2020) predicts keywords
as the global plan to guide the generation. All mod-
els are implemented with T5-base except for GPT2.
More details are in Appendix A