
of auxiliary verbs (such as did, does, would) in el-
lipsis constructions. We find that although models
often favor an auxiliary verb that targets the main
clause, they also make frequent errors, and they
very rarely favor both of the auxiliary forms that
align with the prior context. These results further-
more raise the important possibility that models
are highly sensitive to preferences for particular
auxiliary verb types, and that this could drive the
at-issueness results as well. With this in mind we
revisit the at-issueness experiments, and find that,
indeed, there are substantial differences in mod-
els’ preferences depending on the identity of the
particular verb that targets the relevant content.
Overall, our results suggest that PLMs have
non-trivial gaps in their understanding of response
dynamics in dialogue. Our results also indicate
certain differences between models: BERT and
RoBERTa show strong bias toward selecting re-
sponses that target the most recent and/or main
clause content, while other models show more re-
liance on individual auxiliary verb properties. In
all cases the results indicate that these PLMs have
not yet achieved ideal sensitivity to response dy-
namics involving at-issueness and ellipsis, and that
effectiveness in dialogue will benefit from addi-
tional training approaches. We make all datasets
and code available for further testing.1
2 Related work
Recent years have seen extensive work on analy-
sis of PLMs. Methodologically, some of the most
popular analysis paradigms targeting model em-
beddings have included classification-based prob-
ing (e.g., Kim et al.,2019;Zhang et al.,2019) and
correlation with similarity judgments (Finkelstein
et al.,2001;Gerz et al.,2016;Conneau and Kiela,
2018). Other work has analyzed PLMs by elicit-
ing and analyzing output predictions (Linzen et al.,
2016;Goldberg,2019). Our work here focuses pri-
marily on the latter methodology, examining and
comparing model output probabilities—however,
our analysis in Section 5.4 uses classification-based
probing. Our work also builds on approaches im-
plementing specialized sentence generation sys-
tems that produce large annotated datasets (Ettinger
et al.,2018;McCoy et al.,2019).
Analyses of PLMs have targeted a variety of
types of linguistic competence. In particular, a
1https://github.com/sangheek16/
dialogue-response-dynamics
large body of work has studied the extent to which
PLMs capture syntactic and semantic informa-
tion (Linzen et al.,2016;Peters et al.,2018;Bacon
and Regier,2019;Hewitt and Manning,2019;Ten-
ney et al.,2019). Less work has addressed the
extent to which PLMs show sensitivity to prag-
matic and discourse information, as we focus on
in this paper. Kurfalı and Östling (2021) study
multilingual models in various discourse tasks via
zero-shot learning. Pandia et al. (2021) investigate
LMs’ pragmatic competence to predict discourse
connectives. Pitler and Nenkova (2009) report that
a supervised classifier is able to identify discourse
relations given syntactic features along with con-
nectives. Patterson and Kehler (2013) implement
a similar idea and show that classifiers are able
to predict the presence of a connective based on
shallow linguistic cues. Koto et al. (2021) explore
pre-trained language models’ capability in captur-
ing discourse level relations. We complement this
existing work by branching into new areas of prag-
matic and discourse knowledge, examining models’
sensitivity to dialogue response dynamics.
Another closely related literature is that in which
PLMs, especially transformer LMs, are used for
building dialogue systems directly. Le et al. (2019)
propose Multimodal Transformer Networks (MTN)
for visual-grounded dialogue tasks. Other work
investigates topic-driven language models for emo-
tion detection in dialogues (Zhu et al.,2021).
Oluwatobi and Mueller (2020) report state-of-
the-art performance on dialogue generation using
transformer-based models. There are also language
models designed for and trained on dialogue or
conversation, such as TransferTransfo (Wolf et al.,
2019), PLATO (Bao et al.,2020), ConveRT (Hen-
derson et al.,2020), TOD-BERT (Wu et al.,2020),
DialoGPT (Zhang et al.,2020), DialogBERT (Gu
et al.,2021), and LaMDA (Thoppilan et al.,2022).
Here we focus on clarifying the extent to which
PLMs pre-trained in the standard paradigm can de-
velop knowledge of dialogue dynamics prior to any
specialized dialogue training. This line of inquiry
serves to broaden our general understanding of lin-
guistic competence of standard PLMs, and also
has implications for use of these standard PLMs as
foundation for further dialogue-specific training.
3 Background
3.1 At-issueness
Our analyses focus on the dynamics that govern
responses in dialogue, and aspects of prior utter-