ing, SelfSuper design a self-supervised task guessing who a
randomly masked speaker is according to the dialogue con-
text (e.g., masking “Monica Geller” of #10). To reduce
noise, SelfSuper design another task to predict whether an
utterance contains the answer. Although decent performance
can be achieved, several urging problems still exist.
Firstly, speaker guessing does not aware the speaker in-
formation in questions and the interlocutor scope. As ran-
domly masking is independent to the question, it cannot tell
which speaker in the dialogue is related to the speaker men-
tioned in the question, e.g., Joey Tribbiani to Joey in
Q1 of Fig. 1. As for the interlocutor scope, we define it as
the utterances said by the corresponding speaker. We point
that utterances have a speaker-centric nature: First, each ut-
terance has target listeners. For example, in Utter. #10 of
Fig. 1, it requires to understand that Joey is a listener, so
“you had the night” is making fun of Joey from
Monica’s scope. Second, an utterance reflects the mes-
sage of experience of its speaker. For example, to answer
Q1 in Fig. 1, it requires to understand “stayed up all
night talking” is the experience appearing in Joey’s
scope. Due to ignoring the question-mentioned interlocutor
and its scope, SelfSuper provides a wrong answer.
Secondly, answer-contained utterance (denoted as key ut-
terance by SelfSuper) prediction prefers utterances similar
to the question, failing to find key utterances not similar to
the question. The reason for this is that answers are likely
to appear in utterances similar to the question. For example,
about 77% questions has answers in top-5 utterances similar
to the question according to SimCSE (Gao, Yao, and Chen
2021) in the dev set of FriendsQA (Yang and Choi 2019).
Furthermore, the utterances extracted by the key utterance
prediction have over 82% overlaps with the top-5 utterances.
Therefore, there are considerable key utterances have been
ignored, leading to overrated attention to similar utterances,
e.g., Q2 in Fig. 1. In fact, answer-contained utterances are
likely to appear as the similar utterances or near the simi-
lar utterances because contiguous utterances in local context
tend to be in a topic relevant to the question. However, the
single utterance prediction cannot realize this observation.
To settle aforementioned problems, so that more answer-
contained utterances can be found and answering process
realizes the question and interlocutor scopes, we propose a
new pipeline framework for DRC. We first propose a new
key utterances extracting method. The method slides a win-
dow through the dialogue, where contiguous utterances in
the window are regarded as a unit. The prediction is made
on these units. Once a unit containing the answer, all utter-
ances in it are selected as key utterances. Based on these
extracted utterances, we then propose Question-Interlocutor
Scope Realized Graph (QuISG) modeling. Instead of treat-
ing utterances as a plain sequence, QuISG constructs a graph
structure over the contextualized text embeddings. The ques-
tion and speaker names mentioned in the question are ex-
plicitly present in QuISG as nodes. The question-mentioning
speaker then connects with its corresponding speaker in the
dialogue. Furthermore, to remind the model of interlocu-
tor scopes, QuISG connects every speaker node in the di-
alogue with words from the speaker’s scope. We verify our
model on two representative DRC benchmarks, FriendsQA
and Molweni (Li et al. 2020). Our model achieves better and
competitive performance against baselines on both bench-
marks, and further experiments indicate the efficacy of our
proposed method.
Related Work
Dialogue Reading Comprehension. Unlike traditional
Machine Reading Comprehension (Rajpurkar et al. 2016),
Dialogue Reading Comprehension (DRC) aims to answer
a question according to the given dialogue. There are sev-
eral related but different types of conversational question
answering: CoQA (Reddy, Chen, and Manning 2018) con-
versationally asks questions after reading Wikipedia arti-
cles. QuAC (Choi et al. 2018) forms a dialogue of QA
between a student and a teacher about Wikipedia arti-
cles. DREAM (Sun et al. 2019) tries to answer multi-
choice questions over dialogues of English exams. To un-
derstand characteristics of speakers, Sang et al. (2022) pro-
pose TVShowGuess in a multi-choice style to predict un-
known speakers in dialogues. Conversely, we focus on DRC
extracting answer spans from a dialogue for an independent
question (Yang and Choi 2019). For our focused DRC, as
dialogues are a kind of special text, Li and Choi (2020) pro-
pose several pretrained and downstream tasks based on ut-
terances in dialogues. To consider coreference and relation-
ship of speakers, Liu et al. (2020) introduce the two types
of knowledge from other dialogue-related tasks. Besides, Li
et al. (2021); Ma, Zhang, and Zhao (2021) model the knowl-
edge of discourse structure for dialogues. To model the com-
plex speaker information and noisy dialogue context, Li and
Zhao (2021) propose two self supervised tasks, i.e., masked-
speaker guessing and key utterance prediction. However, ex-
isting work ignores explicitly modeling the question and
speaker scopes and suffers from low key-utterance coverage.
Dialogue Modeling with Graph Representations. In
many QA tasks (Yang et al. 2018; Talmor et al. 2019), graphs
are the main carrier for reasoning (Qiu et al. 2019; Fang et al.
2020; Yasunaga et al. 2021). As for dialogue understanding,
graphs are still a hotspot for various purposes. In dialogue
emotion recognition, graphs are constructed to consider the
interactions between different parties of speakers (Ghosal
et al. 2019; Ishiwatari et al. 2020; Shen et al. 2021). In di-
alogue act classification, graphs model the cross-utterances
and cross-tasks information (Qin et al. 2021). In dialogue se-
mantic modeling, Bai et al. (2021) extend AMR (Banarescu
et al. 2013) to construct graphs for dialogues. As for our
focused DRC, graphs are constructed for knowledge propa-
gation between utterances by works (Liu et al. 2020; Li et al.
2021; Ma, Zhang, and Zhao 2021) mentioned above.
Framework
Task Definition
Given a dialogue consisting of Nutterances: D=[utter1
, utter2, ..., utterN], the task aims to extract the answer span
afor a question q= [qw1, qw2, ..., qwLq]from D, where
qwiis the i-th word in qand Lqis the length of q. In D,