Controlled Text Reduction Aviv Slobodkin Paul Roit Eran Hirsch Ori Ernst Ido Dagan Bar-Ilan University

2025-05-06 0 0 1.25MB 17 页 10玖币
侵权投诉
Controlled Text Reduction
Aviv Slobodkin, Paul Roit, Eran Hirsch, Ori Ernst, Ido Dagan
Bar-Ilan University
{lovodkin93,plroit,hirsch.eran,oriern}@gmail.com
dagan@cs.biu.ac.il
Abstract
Producing a reduced version of a source text,
as in generic or focused summarization, in-
herently involves two distinct subtasks: decid-
ing on targeted content and generating a co-
herent text conveying it. While some popu-
lar approaches address summarization as a sin-
gle end-to-end task, prominent works support
decomposed modeling for individual subtasks.
Further, semi-automated text reduction is also
very appealing, where users may identify tar-
geted content while models would generate a
corresponding coherent summary.
In this paper, we focus on the second subtask,
of generating coherent text given pre-selected
content. Concretely, we formalize Controlled
Text Reduction as a standalone task, whose
input is a source text with marked spans of
targeted content ("highlighting"). A model
then needs to generate a coherent text that in-
cludes all and only the target information. We
advocate the potential of such models, both
for modular fully-automatic summarization, as
well as for semi-automated human-in-the-loop
use cases. Facilitating proper research, we
crowdsource high-quality dev and test datasets
for the task. Further, we automatically gener-
ate a larger "silver" training dataset from avail-
able summarization benchmarks, leveraging a
pretrained summary-source alignment model.
Finally, employing these datasets, we present
a supervised baseline model, showing promis-
ing results and insightful analyses.1
1 Introduction
Abstractive text summarization takes one or more
documents as input and aims at generating an accu-
rate and coherent summary from it. It requires both
locating salient information in the input and then
1Our data and code are released for open access:
https://huggingface.co/datasets/biu-nlp/
Controlled-Text-Reduction-dataset
https://github.com/lovodkin93/Controlled_Text_
Reduction
generating a concise text covering it. While some
modern state-of-the-art abstractive summarization
models treat the task as a single end-to-end task, it
has been common practice for summarization mod-
els to separate the salience detection phase from
the text generation phase (Barzilay and McKeown,
2005;Oya et al.,2014;Banerjee et al.,2016;Vilca
and Cabezudo,2017), with renewed popularity in
recent years (Lebanoff et al.,2019,2020a,b;Xiao
et al.,2022;Ernst et al.,2021a;Gehrmann et al.,
2018a;Chen and Bansal,2018;Cho et al.,2019).
But, though those proposed techniques comprised
distinguishable subtasks, evaluation was performed
on the whole summarization pipeline, rather than
optimizing each step separately.
In this paper, we focus on the text generation
step, while addressing it as a standalone task at the
sub-sentence level. To that end, we introduce a new
task which we denote Controlled Text Reduction.
The task takes as input a document with pre-chosen
salient spans in it, which we will henceforth call
highlights. A model is then expected to reduce the
document to a smaller coherent text which covers
all and only the highlighted content, i.e., consoli-
dating the highlighted spans into a fluent and coher-
ent passage, as exemplified in Figure 1. This task
poses a challenge, as it requires generating fluent
and grammatical text from non-consecutive spans
while keeping it faithful to the source document.
Hence, to balance the coherency and faithfulness
constraints, models will be expected to use the
context document to fill in implied details and to
properly connect the different spans.
Focusing on this task can facilitate greater con-
trol over the generated text. It could lead to a mod-
ular summarization pipeline, where text-generation
models can be trained once, and then used with dif-
ferent content selections to accommodate different
needs. For example, we may envision a user (e.g., a
student) pre-selecting the desirable textual content
(either manually or via a designated model) while
arXiv:2210.13449v1 [cs.CL] 24 Oct 2022
Figure 1: An example of an input, consisting of a source document and highlights (left), and the generated passage
covering the highlighted content while preserving coherence (right). Such highlights in realistic use cases may be
produced either by a human user or by a salience detection model.
focusing on personal needs, possibly interactively
(Hirsch et al.,2021;Shapira et al.,2021). Then, an
available controlled text reduction module would
transform the pre-selected fragments into a concise
summary. Also, separating the content selection
and generation stages can lead to developing data-
efficient systems, one to model salient content and
another to generate the text. It could also lead to
a more efficient characterization and research of
each step separately without the need for probing,
which is the prevailing approach in end-to-end mod-
els (Conneau et al.,2018;Tenney et al.,2019a,b;
Slobodkin et al.,2021;Pandit and Hou,2021).
To promote research on the advocated text re-
duction task, we first develop a suitable controlled
crowdsourcing methodology, following Roit et al.
(2020), and apply it to produce high-quality dev
and test datasets (§4). Next, we automatically gen-
erate a larger training dataset, by aligning propo-
sitional units of information (Ernst et al.,2021b),
extracted with OpenIE (Stanovsky et al.,2018), be-
tween source documents and their summaries (§5).
We use this data to train an abstractive supervised
model, and evaluate its performance against our
testset while comparing it to an extractive reference
baseline, which simply concatenates the highlights.
We also perform analyses where we manipulate the
highlights and show that the addition of highlights
to a supervised model is helpful in steering the
model toward the pre-selected content, in addition
to improving overall faithfulness and fluency (§8).
Hence, the contribution of this paper is manifold:
1.
Proposing the "Controlled Text Reduction"
task as a standalone module in automated or
semi-automated use cases.
2.
Defining an intuitive and easy-to-reproduce
crowd-sourcing method for the task.
3.
Constructing the first data suite for the task,
including crowd-sourced dev and test sets and
an automatically-generated train set.
4.
Developing a supervised baseline model for
future work.
2 Background
In this section, we briefly review related work and
discuss the limitations of their framing.
As mentioned above, much of the related previ-
ous work focused primarily on end-to-end summa-
rization (Carbonell and Goldstein,1998;Haghighi
and Vanderwende,2009;Nallapati et al.,2016b,a;
Paulus et al.,2017;Gehrmann et al.,2018b), with
the vast majority of related datasets aimed at end-
to-end summarization (Fabbri et al.,2019;Kim
et al.,2019;Ghalandari et al.,2020), with only a
source document as input. On the other hand, re-
search on leveraging control through the injection
of pre-chosen (rather than learned) signals in the
seq-to-seq scenario focused mostly on semantic
and syntactic signals, and also almost exclusively
targeted Machine Translation models (Bugliarello
and Okazaki,2020;Akoury et al.,2019;Sundarara-
man et al.,2019;Choshen and Abend,2021;Slo-
bodkin et al.,2022).
Attempts to leverage some control over the gen-
eration step in summarization received attention
in recent years in the form of query-focused sum-
marization (Baumel et al.,2018;Xu and Lapata,
2020,2021;Wei and Zhizhuo,2017) and keywords-
focused summarization (Keskar et al.,2019;He
et al.,2020), with a few recently published corre-
sponding datasets (Pasunuru et al.,2021;Kulkarni
et al.,2020;Baumel et al.,2016). A similar trend
tried to leverage control through the addition of a
planning step (Zhao et al.,2020;Narayan et al.,
2021). Although these lines of research allowed for
some control over salience, this control was limited
and mostly focused on biasing the summary’s topic,
style, or structure.
The prevailing way to treat summarization in
earlier works was to separate the salience detec-
tion phase from the text generation phase (Barzilay
and McKeown,2005;Oya et al.,2014;Banerjee
et al.,2016;Vilca and Cabezudo,2017), yet the
evaluation was performed on the whole pipeline.
Figure 2: The Highlighting Annotation UI, presenting a document and its corresponding summary. Saved align-
ments have a faded yellow background, whereas currently selected alignments (which haven’t been saved yet)
have a normal yellow background. The current summary sentence is marked in a red box. Also, the bold feature
is activated, meaning the document words which are related to those in the summary sentence are boldfaced (see
§4.1).
Some recent work focused on salience detection
(Ernst et al.,2021a,b;Gehrmann et al.,2018a;Chen
and Bansal,2018;Cho et al.,2019), whereas the
generation step has mostly been explored in a full-
sentence-fusion setting (Geva et al.,2019;Lebanoff
et al.,2019,2020b;Xiao et al.,2022), rather than
in a sub-sentence level. Lebanoff et al. (2020a)
took it one step further, leveraging sentence fusion
through a fine-grained content selection algorithm.
But, though they did perform some analysis of this
additional step by comparing different salience de-
tection strategies, his evaluation focused on the full
pipeline, similarly to his predecessors.
There has also been some work on extracting
salient information in source documents in the form
of highlights (Cho et al.,2020;Arumae et al.,2019).
Yet, though acknowledging the full potential of us-
ing highlights to mark salient information in the
source document, it mainly focused on the process
of obtaining these highlights, overlooking its actual
usage in subsequent generation tasks, and in sum-
marization in particular. Moreover, these lines of
work focused solely on automatic highlight detec-
tion, lacking any crowdsourced annotation scheme.
There has also been work that pre-identified salient
parts as input to the generation phase (Chen and
Bansal,2018;Xu et al.,2020;Liu et al.,2021;
Deutsch and Roth,2021) But, contrary to our work,
the salience detection and generation tasks were
addressed and evaluated jointly, without assessing
the quality of each individual task.
All those research directions recognized the po-
tential of separating the summarization task into
subtasks and performing each subtask explicitly.
However, they all evaluated the subtasks jointly,
and in doing so overlooked the potential laying in
the optimization and characterization of each task
individually, and specifically the generation task
given content-selection. In this work, we propose
to isolate the generation task given pre-selected
content, treating it as a stand-alone task, thus pro-
moting focused evaluation and model designing.
3 Task Definition
We define the controlled Text Reduction task as
follows. Given a document and a set of marked
spans within that document, denoted as highlights,
produce a coherent output text encompassing only
the information provided within those highlights
(see Figure 1). The desirable output should ad-
here to two requirements beyond coherency: (1)
Its content has to be derived from the highlights
alone, keeping any additional document premises
to the minimum required for coherency; (2) The
output has to retain all of the details covered by the
highlighted spans.
Such requirements give rise to many interest-
ing challenges, such as recognizing the connecting
thread between disparate spans and faithfully repre-
senting the information contained within them. We
forgo a strict definition for a highlighted span and
allow possibly marking sub-sentence elements: an
entity or a clause, even discontinuous descriptions
of these (e.g., the last two highlights in Figure 1).
Figure 3: Illustration of Highlighting Annotation process for a summary sentence: [1] A summary fact is located
and highlighted; [2] The matching document spans are highlighted, and the alignment is saved; [3] Another sum-
mary fact is identified and highlighted; [4] The matching document spans are highlighted, and the alignment is
saved; [5] When the summary sentence is fully highlighted, we proceed to the next sentence, and so on. In this
example, the summary consists of two facts, but steps 1 and 2 can be repeated as needed per sentence, until all its
propositions (facts) are covered.
Hence, the input highlights may be disconnected
in both their surface realization (i.e. grammatically
unsuitable), and semantic fluency.
Figure 1 features an input-output example. The
output covers exclusively and completely the high-
lighted information while using the source docu-
ment’s context to connect the disparate spans.
4 Gold Dataset for Evaluation
We leverage different summarization datasets to
annotate a high-quality dataset for the evaluation
of controlled-reduction systems. In summariza-
tion, every summary arises from a set of salient
document spans. Exploiting this in our annotation
process, we wish to "reverse-engineer" each sum-
mary and locate the spans in the document that led
to its construction. This significantly reduces the
annotation complexity and load, instead of compil-
ing a new text given a set of highlighted spans, an
annotator has to highlight document spans given
the output text (i.e. the summary).
To create our development and test partitions we
sample 121 and 108 unique documents from DUC
2001 and 2002 Single-Document-Summarization
(SDS) datasets
2
respectively. Each document is
accompanied by up to 4 different reference sum-
maries (with an average of 2.14 summaries per doc-
ument), resulting in a total of 488 unique document-
summary pairs (see Table 1 for full statistics and
§Afor preprocessing details).
We build an intuitive and convenient annota-
tion tool for extracting highlights from document-
summary pairs
3
, designed to be embedded into
crowdsourcing platforms (see §4.1 and Figure 2).
Given the complexity of our task, we follow Roit
2https://duc.nist.gov/
3https://github.com/lovodkin93/
highlights-extract-app
et al. (2020)’s controlled crowdsourcing setup, in-
cluding principled steps of annotator recruitment
and training, leading to a trusted and qualified anno-
tators group, employed for the annotation process.
4.1 Annotation Process
To annotate document spans, whose content corre-
sponds to the summary content, we build a web-
based user interface that is published on Amazon
Mechanical Turk
4
and used by crowd-workers (see
Figure 2). An annotator is presented with a docu-
ment and its reference summary side-by-side and
is instructed to highlight all of the phrases in the
document whose content corresponds to the sum-
mary (see yellow background in Figure 2). To fa-
cilitate accurate and systematic processing of each
instance, workers are asked to align spans from the
summary that comprise a single fact to minimal
spans in the document which cover them. Thus,
annotators create a series of alignments that cover
every piece of information in the summary (see
Figure 3 for illustration of the annotation flow).
We observed that processing summary text one
fact at a time substantially focuses the annotators’
attention and expedites the search for relevant spans
in the document. This is exemplified when a single
sentence in the summary is comprised of details
that are mentioned in different locations spread out
across the source document (e.g., the first summary
sentence in Figure 1). Further, to streamline the pro-
cess, we segment the document into paragraphs and
bolden content words in the document that share
the same lemma with words in the current sum-
mary sentence (see document side in Figure 2 and
also §Afor details). This method helps the human
annotator to skim quickly through the document
and is relatively bias-free. It is our assumption that
4www.mturk.com
摘要:

ControlledTextReductionAvivSlobodkin,PaulRoit,EranHirsch,OriErnst,IdoDaganBar-IlanUniversity{lovodkin93,plroit,hirsch.eran,oriern}@gmail.comdagan@cs.biu.ac.ilAbstractProducingareducedversionofasourcetext,asingenericorfocusedsummarization,in-herentlyinvolvestwodistinctsubtasks:decid-ingontargetedcont...

展开>> 收起<<
Controlled Text Reduction Aviv Slobodkin Paul Roit Eran Hirsch Ori Ernst Ido Dagan Bar-Ilan University.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.25MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注