Cross-Lingual Speaker Identiﬁcation Using Distant Supervision Ben Zhou1Dian Yu2Dong Yu2Dan Roth1 1University of Pennsylvania Philadelphia PA

2025-04-26 0 0 252.43KB 7 页 10玖币

侵权投诉

Cross-Lingual Speaker Identiﬁcation Using Distant Supervision

Ben Zhou1∗Dian Yu2Dong Yu2Dan Roth1

1University of Pennsylvania, Philadelphia, PA

2Tencent AI Lab, Bellevue, WA

{xyzhou, danroth}@seas.upenn.edu, {yudian, dyu}@tencent.com

Abstract

Speaker identiﬁcation, determining which

character said each utterance in literary text,

beneﬁts many downstream tasks. Most ex-

isting approaches use expert-deﬁned rules or

rule-based features to directly approach this

task, but these approaches come with signiﬁ-

cant drawbacks, such as lack of contextual rea-

soning and poor cross-lingual generalization.

In this work, we propose a speaker identiﬁ-

cation framework that addresses these issues.

We ﬁrst extract large-scale distant supervision

signals in English via general-purpose tools

and heuristics, and then apply these weakly-

labeled instances with a focus on encouraging

contextual reasoning to train a cross-lingual

language model. We show that the result-

ing model outperforms previous state-of-the-

art methods on two English speaker identiﬁca-

tion benchmarks by up to 9% in accuracy and

5% with only distant supervision, as well as

two Chinese speaker identiﬁcation datasets by

up to 4.7%.

1 Introduction

Speaker identiﬁcation (SI) aims to decide who said

or implied each quote/utterance in a document (El-

son and McKeown,2010). It is mostly studied in

the domain of literature and novels because, unlike

news, the speakers in stories are often not explicitly

speciﬁed by a name. Thus, contextual reasoning is

required for most SI tasks (Jia et al.,2020) (e.g., the

implicit speaker of paragraph

) is “Wickham”

in Table 1). Besides, as SI datasets are usually

too small-scale to effectively train large pre-trained

language models, most previous studies boost the

mono-lingual performance by additionally design-

ing language-speciﬁc patterns and heuristics, which

require mono-lingual expertise and cannot be easily

transferred to other languages.

In this work, we address these issues with a novel

framework for cross-lingual SI without relying on

∗Most work done while interning at Tencent AI Lab.

RuleSI

NLP Tools + Rule

Large

English

Corpus

Large-scale

Distant

Supervision

DISSI Base LM

Transform

English

Chinese

Extract

Input

Train

Eval

Figure 1: Overview of our framework. RULESI

extracts incidental supervision signals used to train

DISSI.

ID Content

P1 The contents of this letter threw Elizabeth into a ﬂutter of spirits ...

P2 ... she was overtaken by Wickham...

P3 “You certainly do,” she replied with a smile ...

P4 “I should be sorry indeed, if it were. We were always good friends;

and now we are better.”

P5 “True. Are the others coming out?”

Table 1: An example requiring contextual reasoning

from the P&P dataset (He et al.,2013).

any domain, task, or language-speciﬁc annotation

for a new language. The framework, as overviewed

in Fig. 1, starts with extracting large-scale distant

and incidental supervision (Roth,2017) from un-

structured English corpora.

We propose a rule-

based system

called RULESI for extraction (§3).

We collect large-scale (55K) weakly-labeled En-

glish instances

with RULESI to enable the training

of large, advanced models and transform them to

encourage more contextual reasoning (§4).

train the ﬁrst cross-lingual SI model

with the

constructed data and call the resulting model

DISSI

(

stantly-

upervised

peaker

dentiﬁcation). We

hypothesize that DISSI can improve cross-lingual

performance because the SI task shares many

language-invariant features (§5).

Experimental results

show that DISSI achieves

state-of-the-art English performance on the P&P

dataset (He et al.,2013), improving

7.0%

in the

We release code and data at:

https://github.com/

Slash0BZ/speaker-identification.

arXiv:2210.05780v1 [cs.CL] 11 Oct 2022

zero-shot setting and

6.2%

with full supervision.

With almost no language-speciﬁc efforts, our cross-

lingual model outperforms state-of-the-art methods

on two Chinese datasets WP (Chen et al.,2019,

2021) and JY (Jia et al.,2020), by up to

4.7%

Comparing to the baseline, our distant supervision

brings an improvement of more than

40%

in real-

istic few-shot settings. In particular, DISSI can be

well applied across languages even without any an-

notation, e.g., achieving 90.6% zero-shot accuracy

on P&P and 89.5% on the Chinese JY dataset.

2 Related Work

Speaker Identiﬁcation.

Language-speciﬁc expert-

designed rules, patterns, and features (Elson and

McKeown,2010;He et al.,2013;Muzny et al.,

2017;Ek et al.,2018) are widely used to identify

speakers. Pavllo et al. (2018) aim to ﬁnd and boot-

strap over lexical patterns for SI, whereas we focus

on using high-precision heuristics to construct dis-

tant instances. Previous cross-lingual SI studies

mainly focus on direct speech identiﬁcation (Kur-

fali and Wirén,2020;Byszuk et al.,2020). To the

best of our knowledge, this is the

ﬁrst work on

cross-lingual SI

without the need for redesigning

rules, patterns, and features for a new language.

Indirect Supervision.

Studies have shown that

distant or indirect supervision is effective in bridg-

ing the knowledge gaps in pre-trained language

models (LMs) (Zhou et al.,2020,2021;Khashabi

et al.,2020). Yu et al. (2022) improve the SI perfor-

mance with self-training while a large-scale clean

dataset is required for training teacher models.

3 English Speaker Identiﬁcation

In this section, we introduce a rule-based SI system

named RULESI (

Rule

-based

): it receives a long

document as input and then outputs (context, utter-

ance, speaker) tuples from the document. RULESI

can be directly applied to identify speakers in En-

glish texts in a given dataset, but we mainly use

to automatically extract incidental signals that

approximate the target task from unlabeled corpora,

later used as distant supervision to train our cross-

lingual SI system DISSI in §5.

3.1 Main Heuristics

RULESI extracts quoted utterances from seg-

mented sentences by simply matching quotation

This is because RULESI is not guaranteed to produce a

predicted speaker for every utterance due to pattern coverage.

marks. For each extracted utterances, we form a

context with its previous three and next two sen-

tences, and ﬁnd all person characters with a named

entity recognition (NER) tool in AllenNLP (Gard-

ner et al.,2017). In the same context, any name

that is a sub-string of a longer name will be merged

as the same character. We then employ three heuris-

tics to try to identify a speaker among the charac-

ters for each utterance. The ﬁrst two are commonly

used rules proposed by He et al. (2013); Muzny

et al. (2017), namely Direct Speaker Identiﬁcation

and Conversation Alternation Patterns. We follow

the same implementation as Muzny et al. (2017),

except that we use an SRL model from AllenNLP

to replace dependency parsers. We refer readers to

this work for details of ﬁrst two rules due to space

limitations. The ﬁrst heuristic collects a list of

speech verbs (e.g., “say”) and uses a dependency

parser to ﬁnd if there is a speech verb connecting a

noun phrase and a target utterance. If so, we regard

the noun phrase as the speaker of the target utter-

ance. The second heuristic assumes that conversa-

tions in novels follow simple speaker alternation

patterns. For example, in consecutive utterances in

Table 1, once we identify that the speaker of P

“Elizabeth”, we assume that she is very likely to say

the utterance of P

. Besides these two rules, we

introduce a new heuristic based on coreference to

address anaphoric speakers such as “she” in P3.

Local Coreference Resolution with Pronouns.

Previous work (Muzny et al.,2017) use coreference

resolution (coref) only for explicit speakers. We ex-

tend the application of coref to all pronouns in the

utterances

, because i) any character mention that

corefs with a ﬁrst-person pronoun (e.g., “I” and

“me”) inside the utterance reveals the speaker and ii)

those that coref with second and third-person pro-

nouns (e.g., “you” and “she”, ) should be excluded

from candidate speakers. We run the AllenNLP

coref model on every three-sentence window as

coref models that though perform reasonably well

on short literal texts often mistakenly reduces the

number of clusters in a lengthy text.

Soft Inference.

All the three above-mentioned

rule-based heuristics will assign speakers sepa-

rately, and they can conﬂict with each other. As

there is no hierarchy among the heuristics, we em-

ploy soft assignments by letting each rules to “vote”

or “vote against” for a candidate. We assign the

speaker with the highest vote count to each utter-

ance.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Cross-LingualSpeakerIdenticationUsingDistantSupervisionBenZhou1DianYu2DongYu2DanRoth11UniversityofPennsylvania,Philadelphia,PA2TencentAILab,Bellevue,WA{xyzhou,danroth}@seas.upenn.edu,{yudian,dyu}@tencent.comAbstractSpeakeridentication,determiningwhichcharactersaideachutteranceinliterarytext,bene...

展开>> 收起<<

Cross-Lingual Speaker Identiﬁcation Using Distant Supervision Ben Zhou1Dian Yu2Dong Yu2Dan Roth1 1University of Pennsylvania Philadelphia PA.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Cross-Lingual Speaker Identiﬁcation Using Distant Supervision Ben Zhou1Dian Yu2Dong Yu2Dan Roth1 1University of Pennsylvania Philadelphia PA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: