
Approach ROUGE-1 ROUGE-2 ROUGE-L BARTScore
PS ONLY 63.57 40.74 62.04 -2.36
FORMALITY + PS 62.00 39.14 60.38 -2.37
FORMALITY ONLY 51.25 22.12 49.96 -2.57
RULES-BASED HEURISTIC 61.77 35.93 55.34 -2.80
HEURISTIC +FORMALITY 56.98 31.91 55.72 -2.59
Table 3: Scores for each of the perspective shift models.
By contrast, the conversation-level model has
the clear advantage of referencing the entire con-
versation at generation time. However, the model
does not have a requirement to produce the same
number of lines as the input and must learn this
property during training. We conjecture that this is
the reason for this model’s relatively weak perfor-
mance relative to the left and right context model.
Additionally, if the model generates more or less
lines than the input dialogue, this can be a conflat-
ing factor in the extractive summarization example
we discuss in Section 5. If the model generates
less lines than the input, it has performed some
part of the summarization process by abstracting
the input into a shorter output; if it has generated
more lines than the input, it has produced a harder
problem for the extractive summarization system
by creating more lines to choose the summary from.
Because of this model’s weaker performance and
this conflating factor, we restrict our remaining ex-
periments in this paper to models that perspective
shift one utterance at a time.
The model with left context only mimics how
a human might read the conversation for the first
time, from top to bottom. This choice of model also
imposes the constraint that the output is the same
number of lines as the input, as desired. However,
the dialogues frequently contain cataphora, espe-
cially in the start of the conversations, where the
first speaker may be addressing a second speaker
who has not yet spoken. For instance, in the exam-
ple “Hannah: Hey, do you have Betty’s number?”,
this is the first utterance of the dialogue. A model
with only left context cannot resolve the word “you”
here any better than the no context model.
The left and right context model addresses this
concern by providing the full conversation as input,
but restricting the output generation to a perspective
shift for a single (marked) utterance. This imposes
the output length constraint without sacrificing con-
textual information. This model performs best on
all 4 metrics. As the scores for left and right con-
text and no context models are relatively close, we
conduct a human evaluation comparing these two
cases. In our blind comparison of 22 conversations,
the left and right context model was preferred over
the no context model 86% of the time (2 annotators,
Cohen’s kappa 0.62).
The conversation-level model may be a good
choice for some applications, where output length
is less important to the downstream task. This
model has a higher degree of abstractiveness, which
can lead to increased fluency but also increased
hallucination. For tasks where this is a concern, the
left and right context model achieves reasonable
fluency while adhering more closely to the task, as
measured by the automatic metrics.
4.2 Formality and Perspective Shift
Approaches
We observe that the perspective
shifting task requires a high degree of formaliza-
tion. We consider several models ranging from
simple rule-based approaches to those relying on
an external formalization dataset in order to better
understand the role of formalization in perspec-
tive shifting. The external dataset we consider is
the Grammarly Yahoo Answers Formality Corpus
(GYAFC) (Rao and Tetreault,2018): a dataset of
approximately 100,000 lines from Yahoo Answers
and formal rephrasings of each line.
Our core method is the BART model trained
under the left and right context formulation (PS
ONLY).
We also consider a heuristic baseline (RULES-
BASED HEURISTIC). For each message, we
prepend the speaker’s name and the word “says” to
the utterance. We replace each instance of the pro-
noun “I” in the message with the speaker’s name.
After observing that most messages are not well-
punctuated, we also append a period to the end of
each utterance. While this heuristic is simple and
ignores many pronoun resolution conflicts, it has
the clear advantage of being highly efficient.
We incorporate the GYAFC corpus as part of