Topical Segmentation of Spoken Narratives A Test Case on Holocaust Survivor Testimonies Eitan WagneryRenana KeydarzAmit PinchevskiOmri Abendy

2025-05-06 0 0 379.25KB 13 页 10玖币
侵权投诉
Topical Segmentation of Spoken Narratives:
A Test Case on Holocaust Survivor Testimonies
Eitan WagnerRenana KeydarAmit PinchevskiOmri Abend
Department of Computer Science Faculty of Law and Digital Humanities
Department of Communication and Journalism
Hebrew University of Jerusalem
{first_name}.{last_name}@mail.huji.ac.il
Abstract
The task of topical segmentation is well stud-
ied, but previous work has mostly addressed it
in the context of structured, well-defined seg-
ments, such as segmentation into paragraphs,
chapters, or segmenting text that originated
from multiple sources. We tackle the task
of segmenting running (spoken) narratives,
which poses hitherto unaddressed challenges.
As a test case, we address Holocaust survivor
testimonies, given in English. Other than the
importance of studying these testimonies for
Holocaust research, we argue that they pro-
vide an interesting test case for topical seg-
mentation, due to their unstructured surface
level, relative abundance (tens of thousands of
such testimonies were collected), and the rel-
atively confined domain that they cover. We
hypothesize that boundary points between seg-
ments correspond to low mutual information
between the sentences proceeding and follow-
ing the boundary. Based on this hypothesis, we
explore a range of algorithmic approaches to
the task, building on previous work on segmen-
tation that uses generative Bayesian modeling
and state-of-the-art neural machinery. Com-
pared to manually annotated references, we
find that the developed approaches show con-
siderable improvements over previous work.1
1 Introduction
Proper representation of narratives in long texts re-
mains an open problem in NLP (Piper et al.,2021;
Castricato et al.,2021;Mikhalkova et al.,2020).
High-quality representations for long texts seem
crucial to the development of document-level text
understanding technology, which is currently unsat-
isfactory (Shaham et al.,2022). A common modern
approach for modeling narratives is as a sequence
of neural states (Wilmot and Keller,2020,2021;
Rashkin et al.,2020). However, a drawback of this
1
Code is provided at
https://github.com/
eitanwagner/holocaust-segmentation
Figure 1: Topical segmentation in Holocaust testi-
monies.
approach is the lack of interpretability, which is
crucial in some contexts.
A different approach represents and visualizes
a narrative as a sequence of interpretable topics
(Antoniak et al.,2019). Inspired by this approach,
we seek to model the narrative of a text using topic
segmentation, dividing long texts into topically co-
herent segments and labeling them, thus creating
a global topical structure in the form of a chain of
topics. Topic segmentation can be useful for the
indexing of a large number of testimonies (tens of
thousands of testimonies have been collected thus
far) and as an intermediate or auxiliary step in tasks
such as summarization (Wu et al.,2021) and event
detection (Wang et al.,2021).
Unlike recent supervised segmentation mod-
els that focus on structured written text, such as
Wikipedia sections (Arnold et al.,2019;Lukasik
et al.,2020) or book chapters (Pethe et al.,2020),
we address the hitherto mostly unaddressed task of
segmenting and labeling unstructured (transcribed)
spoken language. For these texts, we don’t have
large datasets of divided text. Moreover, there may
not be any obvious boundaries that can be derived
arXiv:2210.13783v2 [cs.CL] 3 Dec 2022
based on local properties. This makes the task more
challenging and hampers the possibility of taking a
completely supervised approach.
We propose an unsupervised alternative for seg-
mentation, based on two assumptions: (1) segment
boundaries correspond to places with low mutual
information between sentences over the boundary;
(2) neural language models can serve as reliable
sentence probability estimators. Based on these
assumptions, we propose a simple approach to seg-
mentation and offer extensions involving dynamic
programming. The proposed models give a sub-
stantial margin over the existing methods in terms
of segmentation performance. In order to adapt
the model to jointly segment and classify, we in-
corporate into the model a supervised topic clas-
sifier, trained over manually indexed one-minute
testimony segments, provided by the USC Shoah
Foundation (SF).
2
Inspired by Misra et al. (2011),
we also incorporate the topical coherence based on
the topic classifier into the segmentation model.
Our contributions are the following: (1) we
present the task of topical segmentation for run-
ning, unedited text; (2) we propose novel algorith-
mic methods for tackling the task without any man-
ual segmentation supervision, building on recent
advances in language modeling; (3) comparing to
previous work, we find substantial improvements
over existing methods; (4) we compile a test set for
evaluation in the case of Holocaust testimonies; (5)
we develop domain-specific topical classifiers to
extract lists of topics for long texts.
Typically, narrative research faces a tradeoff be-
tween the number of narrative texts, which is im-
portant for computational methods, and the speci-
ficity of the narrative context, which is essential for
qualitative narrative research (Sultana et al.,2022).
Holocaust testimonies provide a unique case of a
large corpus with a specific context. Our work also
communicates with Holocaust research, seeking
methods to better access testimonies as the survivor
generation is slowly passing away (Artstein et al.,
2016). We expect our methods to promote schema-
based analysis and browsing of testimonies, en-
abling better access and understanding.
2 Previous work
Text Segmentation.
Considerable previous work
addressed the task of text segmentation, using both
supervised and unsupervised approaches. Proposed
2https://sfi.usc.edu/
methods for unsupervised text segmentation can
be divided into linear segmentation algorithms and
dynamic graph-based segmentation algorithms.
Linear segmentation, i.e., segmentation that is
performed on the fly, dates back to the TextTiling
algorithm (Hearst,1997), which detects boundaries
using window-based vocabulary changes. Recently,
He et al. (2020) proposed an improvement to the
algorithm, which, unlike TextTiling, uses the vo-
cabulary of the entire dataset and not only of the
currently considered segment. TopicTiling (Riedl
and Biemann,2012) uses a similar approach, using
LDA-based topical coherence instead of vocabu-
lary only. This method produces topics as well as
segments. Another linear model, BATS (Wu et al.,
2020), uses combined spectral and agglomerative
clustering for topics and segments.
In contrast to the linear approach, several mod-
els follow a Bayesian sequence modeling approach,
using dynamic programming for inference. This
approach allows making a global prediction of the
segmentation, at the expense of higher complex-
ity. Implementation details vary, and include using
pretrained LDA models (Misra et al.,2011), online
topic estimation (Eisenstein and Barzilay,2008;
Mota et al.,2019), shared topics (Jeong and Titov,
2010), ordering-based topics (Du et al.,2015), and
context-aware LDA (Li et al.,2020b).
Following recent advances in neural models,
these models have been used for the task of super-
vised text segmentation. Pethe et al. (2020) intro-
duced ChapterCaptor which relies on two methods.
The first method performs chapter break prediction
based on Next Sentence Prediction (NSP) scores.
The second method uses dynamic programming to
regularize the segment lengths towards the aver-
age. The models use supervision for finetuning the
model for boundary scores, but can also be used in
a completely unsupervised fashion. They experi-
ment with segmenting books into chapters, which
offers natural incidental supervision.
Another approach performs the segmentation
task in a completely supervised manner, similar to
supervised labeled span extraction tasks. At first,
the models were LSTM-based (Koshorek et al.,
2018;Arnold et al.,2019), and later on, Trans-
former based (Somasundaran et al.,2020;Lukasik
et al.,2020). Unlike finetuning, this approach re-
quires a large amount of segmented data.
All of these works were designed and evaluated
with structured written text, such as book chapters,
Wikipedia pages, or artificially stitched segments,
where supervised data is abundant. In this work, we
address the segmentation of texts of which we have
little supervised data regarding segment boundaries.
We, therefore, adopt elements from the unsuper-
vised approaches combined with supervised com-
ponents and design a model for a novel segmenta-
tion task of unstructured spoken narratives.
Narrative analysis.
Much work has been done
in the direction of probabilistic schema inference,
focusing on either event schemas (Chambers and
Jurafsky,2009;Chambers,2013;Li et al.,2020a)
or persona schemas (Bamman et al.,2013,2014).
Recently, neural models were utilized for story
modeling. Wilmot and Keller (2020) presented a
neural GPT2-based model for suspense in short
stories. This work follows an information-based
framework, modeling the reader’s suspense by dif-
ferent types of predictability. Due to their strong
performance in text generation, neural models are
commonly used for story generation, with numer-
ous structural variations (Zhai et al.,2019;Rashkin
et al.,2020;Alhussain and Azmi,2021).
Narrative analysis can help in conveying the
essence of stories, without all the details. This
can aid the meta-analysis of stories. Min and Park
(2019) visualized plot progressions in stories in
various ways, including the progression of char-
acter relations. Antoniak et al. (2019) analyzed
birth stories, using simplistic, uniform segmenta-
tion with topic modeling to visualize the frequent
topic paths.
3 Methods
We have a document
X
consisting of
n
sentences
x1· · · xn
, which we consider as atomic units. Our
task is to find
k1
boundary points, defining
k
segments, and
k
topics, where every consecutive
pair of topics is different.
3.1 Design Principles of Used Methods
Designing a model for topical segmentation in-
volves multiple, possibly independent, consider-
ations which we present here.
Local Potential-Boundary Scores.
A simple ap-
proach to text segmentation involves giving inde-
pendent local scores to each possible boundary.
Given these scores and the desired number of seg-
ments, we can then select the best boundaries.
Recent work in this direction uses the Next Sen-
tence Prediction (NSP) scores (Pethe et al.,2020).
Given two sentences
x1, x2
, their NSP score is de-
fined as the predicted probability that the second
sentence actually came after the first and not from
somewhere else. The prediction is usually carried
out using a pretrained model with a self-supervised
training protocol and is typically further finetuned
for a specific task.
We argue that the pretrained NSP scores do not
capture the probability of two given sequential sen-
tences being in the same segment, since even if the
second sentence is in a new segment, it still is the
next sentence. Therefore, we expect this approach
to perform poorly in settings for which there are
not enough segmented texts for finetuning.
Instead, we propose to use Point-wise Mutual
Information (PMI) for the local boundary scores.
Given a language model (LM), we hypothesize that
the mutual information between two adjacent sen-
tences can predict how likely the two sentences
are to be in the same segment. These scores need
additional supervision beyond the LM pretraining.
Given these scores, the extraction of a segmenta-
tion for a given text is equivalent to maximizing the
LM likelihood of text, under the assumptions that
each sentence depends on one previous sentence,
and that each segment depends on no previous sen-
tences (for proof see Appendix A).
Non-local Scores.
Full segmentation of text in-
volves the selection of multiple boundaries, and
these selections might not be independent. Even
a single segment directly involves two boundaries.
Therefore, we might want to use scores that take
into account properties that involve more than one
boundary. Given scores for all possible segments,
we can optimize for the maximal total score over
all possible segmentations.
A simple property that was used in previous
work is the segment length (Pethe et al.,2020),
with a higher score given to segments whose length
is closer to the expected length. These scores can
be helpful if we assume that segments’ length tends
to be close to uniform. These scores can also be
used in a conditional manner, in case we have es-
timates for the segment lengths of different topics
or in different locations of the whole text. Segment
length scores require the consideration of at least
two corresponding boundaries for each score.
Another property that was used in previous work
is topic scores (Misra et al.,2011). Given some
Topic Model (TM), we can use the generation log-
likelihood of a segment as its score. Alternatively,
摘要:

TopicalSegmentationofSpokenNarratives:ATestCaseonHolocaustSurvivorTestimoniesEitanWagneryRenanaKeydarzAmitPinchevski}OmriAbendyyDepartmentofComputerSciencezFacultyofLawandDigitalHumanities}DepartmentofCommunicationandJournalismHebrewUniversityofJerusalem{first_name}.{last_name}@mail.huji.ac.ilAbstra...

展开>> 收起<<
Topical Segmentation of Spoken Narratives A Test Case on Holocaust Survivor Testimonies Eitan WagneryRenana KeydarzAmit PinchevskiOmri Abendy.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:379.25KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注