
zero-shot setting and
6.2%
with full supervision.
With almost no language-specific efforts, our cross-
lingual model outperforms state-of-the-art methods
on two Chinese datasets WP (Chen et al.,2019,
2021) and JY (Jia et al.,2020), by up to
4.7%
.
Comparing to the baseline, our distant supervision
brings an improvement of more than
40%
in real-
istic few-shot settings. In particular, DISSI can be
well applied across languages even without any an-
notation, e.g., achieving 90.6% zero-shot accuracy
on P&P and 89.5% on the Chinese JY dataset.
2 Related Work
Speaker Identification.
Language-specific expert-
designed rules, patterns, and features (Elson and
McKeown,2010;He et al.,2013;Muzny et al.,
2017;Ek et al.,2018) are widely used to identify
speakers. Pavllo et al. (2018) aim to find and boot-
strap over lexical patterns for SI, whereas we focus
on using high-precision heuristics to construct dis-
tant instances. Previous cross-lingual SI studies
mainly focus on direct speech identification (Kur-
fali and Wirén,2020;Byszuk et al.,2020). To the
best of our knowledge, this is the
first work on
cross-lingual SI
without the need for redesigning
rules, patterns, and features for a new language.
Indirect Supervision.
Studies have shown that
distant or indirect supervision is effective in bridg-
ing the knowledge gaps in pre-trained language
models (LMs) (Zhou et al.,2020,2021;Khashabi
et al.,2020). Yu et al. (2022) improve the SI perfor-
mance with self-training while a large-scale clean
dataset is required for training teacher models.
3 English Speaker Identification
In this section, we introduce a rule-based SI system
named RULESI (
Rule
-based
SI
): it receives a long
document as input and then outputs (context, utter-
ance, speaker) tuples from the document. RULESI
can be directly applied to identify speakers in En-
glish texts in a given dataset, but we mainly use
it
2
to automatically extract incidental signals that
approximate the target task from unlabeled corpora,
later used as distant supervision to train our cross-
lingual SI system DISSI in §5.
3.1 Main Heuristics
RULESI extracts quoted utterances from seg-
mented sentences by simply matching quotation
2
This is because RULESI is not guaranteed to produce a
predicted speaker for every utterance due to pattern coverage.
marks. For each extracted utterances, we form a
context with its previous three and next two sen-
tences, and find all person characters with a named
entity recognition (NER) tool in AllenNLP (Gard-
ner et al.,2017). In the same context, any name
that is a sub-string of a longer name will be merged
as the same character. We then employ three heuris-
tics to try to identify a speaker among the charac-
ters for each utterance. The first two are commonly
used rules proposed by He et al. (2013); Muzny
et al. (2017), namely Direct Speaker Identification
and Conversation Alternation Patterns. We follow
the same implementation as Muzny et al. (2017),
except that we use an SRL model from AllenNLP
to replace dependency parsers. We refer readers to
this work for details of first two rules due to space
limitations. The first heuristic collects a list of
speech verbs (e.g., “say”) and uses a dependency
parser to find if there is a speech verb connecting a
noun phrase and a target utterance. If so, we regard
the noun phrase as the speaker of the target utter-
ance. The second heuristic assumes that conversa-
tions in novels follow simple speaker alternation
patterns. For example, in consecutive utterances in
Table 1, once we identify that the speaker of P
3
is
“Elizabeth”, we assume that she is very likely to say
the utterance of P
5
. Besides these two rules, we
introduce a new heuristic based on coreference to
address anaphoric speakers such as “she” in P3.
Local Coreference Resolution with Pronouns.
Previous work (Muzny et al.,2017) use coreference
resolution (coref) only for explicit speakers. We ex-
tend the application of coref to all pronouns in the
utterances
, because i) any character mention that
corefs with a first-person pronoun (e.g., “I” and
“me”) inside the utterance reveals the speaker and ii)
those that coref with second and third-person pro-
nouns (e.g., “you” and “she”, ) should be excluded
from candidate speakers. We run the AllenNLP
coref model on every three-sentence window as
coref models that though perform reasonably well
on short literal texts often mistakenly reduces the
number of clusters in a lengthy text.
Soft Inference.
All the three above-mentioned
rule-based heuristics will assign speakers sepa-
rately, and they can conflict with each other. As
there is no hierarchy among the heuristics, we em-
ploy soft assignments by letting each rules to “vote”
or “vote against” for a candidate. We assign the
speaker with the highest vote count to each utter-
ance.