Cross-Lingual Speaker Identification Using Distant Supervision Ben Zhou1Dian Yu2Dong Yu2Dan Roth1 1University of Pennsylvania Philadelphia PA

2025-04-26 0 0 252.43KB 7 页 10玖币
侵权投诉
Cross-Lingual Speaker Identification Using Distant Supervision
Ben Zhou1Dian Yu2Dong Yu2Dan Roth1
1University of Pennsylvania, Philadelphia, PA
2Tencent AI Lab, Bellevue, WA
{xyzhou, danroth}@seas.upenn.edu, {yudian, dyu}@tencent.com
Abstract
Speaker identification, determining which
character said each utterance in literary text,
benefits many downstream tasks. Most ex-
isting approaches use expert-defined rules or
rule-based features to directly approach this
task, but these approaches come with signifi-
cant drawbacks, such as lack of contextual rea-
soning and poor cross-lingual generalization.
In this work, we propose a speaker identifi-
cation framework that addresses these issues.
We first extract large-scale distant supervision
signals in English via general-purpose tools
and heuristics, and then apply these weakly-
labeled instances with a focus on encouraging
contextual reasoning to train a cross-lingual
language model. We show that the result-
ing model outperforms previous state-of-the-
art methods on two English speaker identifica-
tion benchmarks by up to 9% in accuracy and
5% with only distant supervision, as well as
two Chinese speaker identification datasets by
up to 4.7%.
1 Introduction
Speaker identification (SI) aims to decide who said
or implied each quote/utterance in a document (El-
son and McKeown,2010). It is mostly studied in
the domain of literature and novels because, unlike
news, the speakers in stories are often not explicitly
specified by a name. Thus, contextual reasoning is
required for most SI tasks (Jia et al.,2020) (e.g., the
implicit speaker of paragraph
4
(P
4
) is “Wickham”
in Table 1). Besides, as SI datasets are usually
too small-scale to effectively train large pre-trained
language models, most previous studies boost the
mono-lingual performance by additionally design-
ing language-specific patterns and heuristics, which
require mono-lingual expertise and cannot be easily
transferred to other languages.
In this work, we address these issues with a novel
framework for cross-lingual SI without relying on
Most work done while interning at Tencent AI Lab.
RuleSI
NLP Tools + Rule
Large
English
Corpus
Large-scale
Distant
Supervision
DISSI Base LM
Transform
English
Chinese
Extract
Input
Train
Eval
Figure 1: Overview of our framework. RULESI
extracts incidental supervision signals used to train
DISSI.
ID Content
P1 The contents of this letter threw Elizabeth into a flutter of spirits ...
P2 ... she was overtaken by Wickham...
P3 “You certainly do,” she replied with a smile ...
P4 “I should be sorry indeed, if it were. We were always good friends;
and now we are better.
P5 “True. Are the others coming out?”
Table 1: An example requiring contextual reasoning
from the P&P dataset (He et al.,2013).
any domain, task, or language-specific annotation
for a new language. The framework, as overviewed
in Fig. 1, starts with extracting large-scale distant
and incidental supervision (Roth,2017) from un-
structured English corpora.
We propose a rule-
based system
called RULESI for extraction (§3).
We collect large-scale (55K) weakly-labeled En-
glish instances
with RULESI to enable the training
of large, advanced models and transform them to
encourage more contextual reasoning (§4).
We
train the first cross-lingual SI model
with the
constructed data and call the resulting model
DISSI
(
Di
stantly-
S
upervised
S
peaker
I
dentification). We
hypothesize that DISSI can improve cross-lingual
performance because the SI task shares many
language-invariant features (§5).
Experimental results
1
show that DISSI achieves
state-of-the-art English performance on the P&P
dataset (He et al.,2013), improving
7.0%
in the
1
We release code and data at:
https://github.com/
Slash0BZ/speaker-identification.
arXiv:2210.05780v1 [cs.CL] 11 Oct 2022
zero-shot setting and
6.2%
with full supervision.
With almost no language-specific efforts, our cross-
lingual model outperforms state-of-the-art methods
on two Chinese datasets WP (Chen et al.,2019,
2021) and JY (Jia et al.,2020), by up to
4.7%
.
Comparing to the baseline, our distant supervision
brings an improvement of more than
40%
in real-
istic few-shot settings. In particular, DISSI can be
well applied across languages even without any an-
notation, e.g., achieving 90.6% zero-shot accuracy
on P&P and 89.5% on the Chinese JY dataset.
2 Related Work
Speaker Identification.
Language-specific expert-
designed rules, patterns, and features (Elson and
McKeown,2010;He et al.,2013;Muzny et al.,
2017;Ek et al.,2018) are widely used to identify
speakers. Pavllo et al. (2018) aim to find and boot-
strap over lexical patterns for SI, whereas we focus
on using high-precision heuristics to construct dis-
tant instances. Previous cross-lingual SI studies
mainly focus on direct speech identification (Kur-
fali and Wirén,2020;Byszuk et al.,2020). To the
best of our knowledge, this is the
first work on
cross-lingual SI
without the need for redesigning
rules, patterns, and features for a new language.
Indirect Supervision.
Studies have shown that
distant or indirect supervision is effective in bridg-
ing the knowledge gaps in pre-trained language
models (LMs) (Zhou et al.,2020,2021;Khashabi
et al.,2020). Yu et al. (2022) improve the SI perfor-
mance with self-training while a large-scale clean
dataset is required for training teacher models.
3 English Speaker Identification
In this section, we introduce a rule-based SI system
named RULESI (
Rule
-based
SI
): it receives a long
document as input and then outputs (context, utter-
ance, speaker) tuples from the document. RULESI
can be directly applied to identify speakers in En-
glish texts in a given dataset, but we mainly use
it
2
to automatically extract incidental signals that
approximate the target task from unlabeled corpora,
later used as distant supervision to train our cross-
lingual SI system DISSI in §5.
3.1 Main Heuristics
RULESI extracts quoted utterances from seg-
mented sentences by simply matching quotation
2
This is because RULESI is not guaranteed to produce a
predicted speaker for every utterance due to pattern coverage.
marks. For each extracted utterances, we form a
context with its previous three and next two sen-
tences, and find all person characters with a named
entity recognition (NER) tool in AllenNLP (Gard-
ner et al.,2017). In the same context, any name
that is a sub-string of a longer name will be merged
as the same character. We then employ three heuris-
tics to try to identify a speaker among the charac-
ters for each utterance. The first two are commonly
used rules proposed by He et al. (2013); Muzny
et al. (2017), namely Direct Speaker Identification
and Conversation Alternation Patterns. We follow
the same implementation as Muzny et al. (2017),
except that we use an SRL model from AllenNLP
to replace dependency parsers. We refer readers to
this work for details of first two rules due to space
limitations. The first heuristic collects a list of
speech verbs (e.g., “say”) and uses a dependency
parser to find if there is a speech verb connecting a
noun phrase and a target utterance. If so, we regard
the noun phrase as the speaker of the target utter-
ance. The second heuristic assumes that conversa-
tions in novels follow simple speaker alternation
patterns. For example, in consecutive utterances in
Table 1, once we identify that the speaker of P
3
is
“Elizabeth”, we assume that she is very likely to say
the utterance of P
5
. Besides these two rules, we
introduce a new heuristic based on coreference to
address anaphoric speakers such as “she” in P3.
Local Coreference Resolution with Pronouns.
Previous work (Muzny et al.,2017) use coreference
resolution (coref) only for explicit speakers. We ex-
tend the application of coref to all pronouns in the
utterances
, because i) any character mention that
corefs with a first-person pronoun (e.g., “I” and
“me”) inside the utterance reveals the speaker and ii)
those that coref with second and third-person pro-
nouns (e.g., “you” and “she”, ) should be excluded
from candidate speakers. We run the AllenNLP
coref model on every three-sentence window as
coref models that though perform reasonably well
on short literal texts often mistakenly reduces the
number of clusters in a lengthy text.
Soft Inference.
All the three above-mentioned
rule-based heuristics will assign speakers sepa-
rately, and they can conflict with each other. As
there is no hierarchy among the heuristics, we em-
ploy soft assignments by letting each rules to “vote”
or “vote against” for a candidate. We assign the
speaker with the highest vote count to each utter-
ance.
摘要:

Cross-LingualSpeakerIdenticationUsingDistantSupervisionBenZhou1DianYu2DongYu2DanRoth11UniversityofPennsylvania,Philadelphia,PA2TencentAILab,Bellevue,WA{xyzhou,danroth}@seas.upenn.edu,{yudian,dyu}@tencent.comAbstractSpeakeridentication,determiningwhichcharactersaideachutteranceinliterarytext,bene...

展开>> 收起<<
Cross-Lingual Speaker Identification Using Distant Supervision Ben Zhou1Dian Yu2Dong Yu2Dan Roth1 1University of Pennsylvania Philadelphia PA.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:252.43KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注