Topical Segmentation of Spoken Narratives A Test Case on Holocaust Survivor Testimonies Eitan WagneryRenana KeydarzAmit PinchevskiOmri Abendy

2025-05-06 0 0 379.25KB 13 页 10玖币

侵权投诉

Topical Segmentation of Spoken Narratives:

A Test Case on Holocaust Survivor Testimonies

Eitan Wagner†Renana Keydar‡Amit Pinchevski♦Omri Abend†

†Department of Computer Science ‡Faculty of Law and Digital Humanities

♦Department of Communication and Journalism

Hebrew University of Jerusalem

{first_name}.{last_name}@mail.huji.ac.il

Abstract

The task of topical segmentation is well stud-

ied, but previous work has mostly addressed it

in the context of structured, well-deﬁned seg-

ments, such as segmentation into paragraphs,

chapters, or segmenting text that originated

from multiple sources. We tackle the task

of segmenting running (spoken) narratives,

which poses hitherto unaddressed challenges.

As a test case, we address Holocaust survivor

testimonies, given in English. Other than the

importance of studying these testimonies for

Holocaust research, we argue that they pro-

vide an interesting test case for topical seg-

mentation, due to their unstructured surface

level, relative abundance (tens of thousands of

such testimonies were collected), and the rel-

atively conﬁned domain that they cover. We

hypothesize that boundary points between seg-

ments correspond to low mutual information

between the sentences proceeding and follow-

ing the boundary. Based on this hypothesis, we

explore a range of algorithmic approaches to

the task, building on previous work on segmen-

tation that uses generative Bayesian modeling

and state-of-the-art neural machinery. Com-

pared to manually annotated references, we

ﬁnd that the developed approaches show con-

siderable improvements over previous work.1

1 Introduction

Proper representation of narratives in long texts re-

mains an open problem in NLP (Piper et al.,2021;

Castricato et al.,2021;Mikhalkova et al.,2020).

High-quality representations for long texts seem

crucial to the development of document-level text

understanding technology, which is currently unsat-

isfactory (Shaham et al.,2022). A common modern

approach for modeling narratives is as a sequence

of neural states (Wilmot and Keller,2020,2021;

Rashkin et al.,2020). However, a drawback of this

Code is provided at

https://github.com/

eitanwagner/holocaust-segmentation

Figure 1: Topical segmentation in Holocaust testi-

monies.

approach is the lack of interpretability, which is

crucial in some contexts.

A different approach represents and visualizes

a narrative as a sequence of interpretable topics

(Antoniak et al.,2019). Inspired by this approach,

we seek to model the narrative of a text using topic

segmentation, dividing long texts into topically co-

herent segments and labeling them, thus creating

a global topical structure in the form of a chain of

topics. Topic segmentation can be useful for the

indexing of a large number of testimonies (tens of

thousands of testimonies have been collected thus

far) and as an intermediate or auxiliary step in tasks

such as summarization (Wu et al.,2021) and event

detection (Wang et al.,2021).

Unlike recent supervised segmentation mod-

els that focus on structured written text, such as

Wikipedia sections (Arnold et al.,2019;Lukasik

et al.,2020) or book chapters (Pethe et al.,2020),

we address the hitherto mostly unaddressed task of

segmenting and labeling unstructured (transcribed)

spoken language. For these texts, we don’t have

large datasets of divided text. Moreover, there may

not be any obvious boundaries that can be derived

arXiv:2210.13783v2 [cs.CL] 3 Dec 2022

based on local properties. This makes the task more

challenging and hampers the possibility of taking a

completely supervised approach.

We propose an unsupervised alternative for seg-

mentation, based on two assumptions: (1) segment

boundaries correspond to places with low mutual

information between sentences over the boundary;

(2) neural language models can serve as reliable

sentence probability estimators. Based on these

assumptions, we propose a simple approach to seg-

mentation and offer extensions involving dynamic

programming. The proposed models give a sub-

stantial margin over the existing methods in terms

of segmentation performance. In order to adapt

the model to jointly segment and classify, we in-

corporate into the model a supervised topic clas-

siﬁer, trained over manually indexed one-minute

testimony segments, provided by the USC Shoah

Foundation (SF).

Inspired by Misra et al. (2011),

we also incorporate the topical coherence based on

the topic classiﬁer into the segmentation model.

Our contributions are the following: (1) we

present the task of topical segmentation for run-

ning, unedited text; (2) we propose novel algorith-

mic methods for tackling the task without any man-

ual segmentation supervision, building on recent

advances in language modeling; (3) comparing to

previous work, we ﬁnd substantial improvements

over existing methods; (4) we compile a test set for

evaluation in the case of Holocaust testimonies; (5)

we develop domain-speciﬁc topical classiﬁers to

extract lists of topics for long texts.

Typically, narrative research faces a tradeoff be-

tween the number of narrative texts, which is im-

portant for computational methods, and the speci-

ﬁcity of the narrative context, which is essential for

qualitative narrative research (Sultana et al.,2022).

Holocaust testimonies provide a unique case of a

large corpus with a speciﬁc context. Our work also

communicates with Holocaust research, seeking

methods to better access testimonies as the survivor

generation is slowly passing away (Artstein et al.,

2016). We expect our methods to promote schema-

based analysis and browsing of testimonies, en-

abling better access and understanding.

2 Previous work

Text Segmentation.

Considerable previous work

addressed the task of text segmentation, using both

supervised and unsupervised approaches. Proposed

2https://sfi.usc.edu/

methods for unsupervised text segmentation can

be divided into linear segmentation algorithms and

dynamic graph-based segmentation algorithms.

Linear segmentation, i.e., segmentation that is

performed on the ﬂy, dates back to the TextTiling

algorithm (Hearst,1997), which detects boundaries

using window-based vocabulary changes. Recently,

He et al. (2020) proposed an improvement to the

algorithm, which, unlike TextTiling, uses the vo-

cabulary of the entire dataset and not only of the

currently considered segment. TopicTiling (Riedl

and Biemann,2012) uses a similar approach, using

LDA-based topical coherence instead of vocabu-

lary only. This method produces topics as well as

segments. Another linear model, BATS (Wu et al.,

2020), uses combined spectral and agglomerative

clustering for topics and segments.

In contrast to the linear approach, several mod-

els follow a Bayesian sequence modeling approach,

using dynamic programming for inference. This

approach allows making a global prediction of the

segmentation, at the expense of higher complex-

ity. Implementation details vary, and include using

pretrained LDA models (Misra et al.,2011), online

topic estimation (Eisenstein and Barzilay,2008;

Mota et al.,2019), shared topics (Jeong and Titov,

2010), ordering-based topics (Du et al.,2015), and

context-aware LDA (Li et al.,2020b).

Following recent advances in neural models,

these models have been used for the task of super-

vised text segmentation. Pethe et al. (2020) intro-

duced ChapterCaptor which relies on two methods.

The ﬁrst method performs chapter break prediction

based on Next Sentence Prediction (NSP) scores.

The second method uses dynamic programming to

regularize the segment lengths towards the aver-

age. The models use supervision for ﬁnetuning the

model for boundary scores, but can also be used in

a completely unsupervised fashion. They experi-

ment with segmenting books into chapters, which

offers natural incidental supervision.

Another approach performs the segmentation

task in a completely supervised manner, similar to

supervised labeled span extraction tasks. At ﬁrst,

the models were LSTM-based (Koshorek et al.,

2018;Arnold et al.,2019), and later on, Trans-

former based (Somasundaran et al.,2020;Lukasik

et al.,2020). Unlike ﬁnetuning, this approach re-

quires a large amount of segmented data.

All of these works were designed and evaluated

with structured written text, such as book chapters,

Wikipedia pages, or artiﬁcially stitched segments,

where supervised data is abundant. In this work, we

address the segmentation of texts of which we have

little supervised data regarding segment boundaries.

We, therefore, adopt elements from the unsuper-

vised approaches combined with supervised com-

ponents and design a model for a novel segmenta-

tion task of unstructured spoken narratives.

Narrative analysis.

Much work has been done

in the direction of probabilistic schema inference,

focusing on either event schemas (Chambers and

Jurafsky,2009;Chambers,2013;Li et al.,2020a)

or persona schemas (Bamman et al.,2013,2014).

Recently, neural models were utilized for story

modeling. Wilmot and Keller (2020) presented a

neural GPT2-based model for suspense in short

stories. This work follows an information-based

framework, modeling the reader’s suspense by dif-

ferent types of predictability. Due to their strong

performance in text generation, neural models are

commonly used for story generation, with numer-

ous structural variations (Zhai et al.,2019;Rashkin

et al.,2020;Alhussain and Azmi,2021).

Narrative analysis can help in conveying the

essence of stories, without all the details. This

can aid the meta-analysis of stories. Min and Park

(2019) visualized plot progressions in stories in

various ways, including the progression of char-

acter relations. Antoniak et al. (2019) analyzed

birth stories, using simplistic, uniform segmenta-

tion with topic modeling to visualize the frequent

topic paths.

3 Methods

We have a document

consisting of

sentences

x1· · · xn

, which we consider as atomic units. Our

task is to ﬁnd

k−1

boundary points, deﬁning

segments, and

topics, where every consecutive

pair of topics is different.

3.1 Design Principles of Used Methods

Designing a model for topical segmentation in-

volves multiple, possibly independent, consider-

ations which we present here.

Local Potential-Boundary Scores.

A simple ap-

proach to text segmentation involves giving inde-

pendent local scores to each possible boundary.

Given these scores and the desired number of seg-

ments, we can then select the best boundaries.

Recent work in this direction uses the Next Sen-

tence Prediction (NSP) scores (Pethe et al.,2020).

Given two sentences

x1, x2

, their NSP score is de-

ﬁned as the predicted probability that the second

sentence actually came after the ﬁrst and not from

somewhere else. The prediction is usually carried

out using a pretrained model with a self-supervised

training protocol and is typically further ﬁnetuned

for a speciﬁc task.

We argue that the pretrained NSP scores do not

capture the probability of two given sequential sen-

tences being in the same segment, since even if the

second sentence is in a new segment, it still is the

next sentence. Therefore, we expect this approach

to perform poorly in settings for which there are

not enough segmented texts for ﬁnetuning.

Instead, we propose to use Point-wise Mutual

Information (PMI) for the local boundary scores.

Given a language model (LM), we hypothesize that

the mutual information between two adjacent sen-

tences can predict how likely the two sentences

are to be in the same segment. These scores need

additional supervision beyond the LM pretraining.

Given these scores, the extraction of a segmenta-

tion for a given text is equivalent to maximizing the

LM likelihood of text, under the assumptions that

each sentence depends on one previous sentence,

and that each segment depends on no previous sen-

tences (for proof see Appendix A).

Non-local Scores.

Full segmentation of text in-

volves the selection of multiple boundaries, and

these selections might not be independent. Even

a single segment directly involves two boundaries.

Therefore, we might want to use scores that take

into account properties that involve more than one

boundary. Given scores for all possible segments,

we can optimize for the maximal total score over

all possible segmentations.

A simple property that was used in previous

work is the segment length (Pethe et al.,2020),

with a higher score given to segments whose length

is closer to the expected length. These scores can

be helpful if we assume that segments’ length tends

to be close to uniform. These scores can also be

used in a conditional manner, in case we have es-

timates for the segment lengths of different topics

or in different locations of the whole text. Segment

length scores require the consideration of at least

two corresponding boundaries for each score.

Another property that was used in previous work

is topic scores (Misra et al.,2011). Given some

Topic Model (TM), we can use the generation log-

likelihood of a segment as its score. Alternatively,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TopicalSegmentationofSpokenNarratives:ATestCaseonHolocaustSurvivorTestimoniesEitanWagneryRenanaKeydarzAmitPinchevski}OmriAbendyyDepartmentofComputerSciencezFacultyofLawandDigitalHumanities}DepartmentofCommunicationandJournalismHebrewUniversityofJerusalem{first_name}.{last_name}@mail.huji.ac.ilAbstra...

展开>> 收起<<

Topical Segmentation of Spoken Narratives A Test Case on Holocaust Survivor Testimonies Eitan WagneryRenana KeydarzAmit PinchevskiOmri Abendy.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Topical Segmentation of Spoken Narratives A Test Case on Holocaust Survivor Testimonies Eitan WagneryRenana KeydarzAmit PinchevskiOmri Abendy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: