Controlled Text Reduction Aviv Slobodkin Paul Roit Eran Hirsch Ori Ernst Ido Dagan Bar-Ilan University

2025-05-06 0 0 1.25MB 17 页 10玖币

侵权投诉

Controlled Text Reduction

Aviv Slobodkin, Paul Roit, Eran Hirsch, Ori Ernst, Ido Dagan

Bar-Ilan University

{lovodkin93,plroit,hirsch.eran,oriern}@gmail.com

dagan@cs.biu.ac.il

Abstract

Producing a reduced version of a source text,

as in generic or focused summarization, in-

herently involves two distinct subtasks: decid-

ing on targeted content and generating a co-

herent text conveying it. While some popu-

lar approaches address summarization as a sin-

gle end-to-end task, prominent works support

decomposed modeling for individual subtasks.

Further, semi-automated text reduction is also

very appealing, where users may identify tar-

geted content while models would generate a

corresponding coherent summary.

In this paper, we focus on the second subtask,

of generating coherent text given pre-selected

content. Concretely, we formalize Controlled

Text Reduction as a standalone task, whose

input is a source text with marked spans of

targeted content ("highlighting"). A model

then needs to generate a coherent text that in-

cludes all and only the target information. We

advocate the potential of such models, both

for modular fully-automatic summarization, as

well as for semi-automated human-in-the-loop

use cases. Facilitating proper research, we

crowdsource high-quality dev and test datasets

for the task. Further, we automatically gener-

ate a larger "silver" training dataset from avail-

able summarization benchmarks, leveraging a

pretrained summary-source alignment model.

Finally, employing these datasets, we present

a supervised baseline model, showing promis-

ing results and insightful analyses.1

1 Introduction

Abstractive text summarization takes one or more

documents as input and aims at generating an accu-

rate and coherent summary from it. It requires both

locating salient information in the input and then

1Our data and code are released for open access:

https://huggingface.co/datasets/biu-nlp/

Controlled-Text-Reduction-dataset

https://github.com/lovodkin93/Controlled_Text_

Reduction

generating a concise text covering it. While some

modern state-of-the-art abstractive summarization

models treat the task as a single end-to-end task, it

has been common practice for summarization mod-

els to separate the salience detection phase from

the text generation phase (Barzilay and McKeown,

2005;Oya et al.,2014;Banerjee et al.,2016;Vilca

and Cabezudo,2017), with renewed popularity in

recent years (Lebanoff et al.,2019,2020a,b;Xiao

et al.,2022;Ernst et al.,2021a;Gehrmann et al.,

2018a;Chen and Bansal,2018;Cho et al.,2019).

But, though those proposed techniques comprised

distinguishable subtasks, evaluation was performed

on the whole summarization pipeline, rather than

optimizing each step separately.

In this paper, we focus on the text generation

step, while addressing it as a standalone task at the

sub-sentence level. To that end, we introduce a new

task which we denote Controlled Text Reduction.

The task takes as input a document with pre-chosen

salient spans in it, which we will henceforth call

highlights. A model is then expected to reduce the

document to a smaller coherent text which covers

all and only the highlighted content, i.e., consoli-

dating the highlighted spans into a ﬂuent and coher-

ent passage, as exempliﬁed in Figure 1. This task

poses a challenge, as it requires generating ﬂuent

and grammatical text from non-consecutive spans

while keeping it faithful to the source document.

Hence, to balance the coherency and faithfulness

constraints, models will be expected to use the

context document to ﬁll in implied details and to

properly connect the different spans.

Focusing on this task can facilitate greater con-

trol over the generated text. It could lead to a mod-

ular summarization pipeline, where text-generation

models can be trained once, and then used with dif-

ferent content selections to accommodate different

needs. For example, we may envision a user (e.g., a

student) pre-selecting the desirable textual content

(either manually or via a designated model) while

arXiv:2210.13449v1 [cs.CL] 24 Oct 2022

Figure 1: An example of an input, consisting of a source document and highlights (left), and the generated passage

covering the highlighted content while preserving coherence (right). Such highlights in realistic use cases may be

produced either by a human user or by a salience detection model.

focusing on personal needs, possibly interactively

(Hirsch et al.,2021;Shapira et al.,2021). Then, an

available controlled text reduction module would

transform the pre-selected fragments into a concise

summary. Also, separating the content selection

and generation stages can lead to developing data-

efﬁcient systems, one to model salient content and

another to generate the text. It could also lead to

a more efﬁcient characterization and research of

each step separately without the need for probing,

which is the prevailing approach in end-to-end mod-

els (Conneau et al.,2018;Tenney et al.,2019a,b;

Slobodkin et al.,2021;Pandit and Hou,2021).

To promote research on the advocated text re-

duction task, we ﬁrst develop a suitable controlled

crowdsourcing methodology, following Roit et al.

(2020), and apply it to produce high-quality dev

and test datasets (§4). Next, we automatically gen-

erate a larger training dataset, by aligning propo-

sitional units of information (Ernst et al.,2021b),

extracted with OpenIE (Stanovsky et al.,2018), be-

tween source documents and their summaries (§5).

We use this data to train an abstractive supervised

model, and evaluate its performance against our

testset while comparing it to an extractive reference

baseline, which simply concatenates the highlights.

We also perform analyses where we manipulate the

highlights and show that the addition of highlights

to a supervised model is helpful in steering the

model toward the pre-selected content, in addition

to improving overall faithfulness and ﬂuency (§8).

Hence, the contribution of this paper is manifold:

Proposing the "Controlled Text Reduction"

task as a standalone module in automated or

semi-automated use cases.

Deﬁning an intuitive and easy-to-reproduce

crowd-sourcing method for the task.

Constructing the ﬁrst data suite for the task,

including crowd-sourced dev and test sets and

an automatically-generated train set.

Developing a supervised baseline model for

future work.

2 Background

In this section, we brieﬂy review related work and

discuss the limitations of their framing.

As mentioned above, much of the related previ-

ous work focused primarily on end-to-end summa-

rization (Carbonell and Goldstein,1998;Haghighi

and Vanderwende,2009;Nallapati et al.,2016b,a;

Paulus et al.,2017;Gehrmann et al.,2018b), with

the vast majority of related datasets aimed at end-

to-end summarization (Fabbri et al.,2019;Kim

et al.,2019;Ghalandari et al.,2020), with only a

source document as input. On the other hand, re-

search on leveraging control through the injection

of pre-chosen (rather than learned) signals in the

seq-to-seq scenario focused mostly on semantic

and syntactic signals, and also almost exclusively

targeted Machine Translation models (Bugliarello

and Okazaki,2020;Akoury et al.,2019;Sundarara-

man et al.,2019;Choshen and Abend,2021;Slo-

bodkin et al.,2022).

Attempts to leverage some control over the gen-

eration step in summarization received attention

in recent years in the form of query-focused sum-

marization (Baumel et al.,2018;Xu and Lapata,

2020,2021;Wei and Zhizhuo,2017) and keywords-

focused summarization (Keskar et al.,2019;He

et al.,2020), with a few recently published corre-

sponding datasets (Pasunuru et al.,2021;Kulkarni

et al.,2020;Baumel et al.,2016). A similar trend

tried to leverage control through the addition of a

planning step (Zhao et al.,2020;Narayan et al.,

2021). Although these lines of research allowed for

some control over salience, this control was limited

and mostly focused on biasing the summary’s topic,

style, or structure.

The prevailing way to treat summarization in

earlier works was to separate the salience detec-

tion phase from the text generation phase (Barzilay

and McKeown,2005;Oya et al.,2014;Banerjee

et al.,2016;Vilca and Cabezudo,2017), yet the

evaluation was performed on the whole pipeline.

Figure 2: The Highlighting Annotation UI, presenting a document and its corresponding summary. Saved align-

ments have a faded yellow background, whereas currently selected alignments (which haven’t been saved yet)

have a normal yellow background. The current summary sentence is marked in a red box. Also, the bold feature

is activated, meaning the document words which are related to those in the summary sentence are boldfaced (see

§4.1).

Some recent work focused on salience detection

(Ernst et al.,2021a,b;Gehrmann et al.,2018a;Chen

and Bansal,2018;Cho et al.,2019), whereas the

generation step has mostly been explored in a full-

sentence-fusion setting (Geva et al.,2019;Lebanoff

et al.,2019,2020b;Xiao et al.,2022), rather than

in a sub-sentence level. Lebanoff et al. (2020a)

took it one step further, leveraging sentence fusion

through a ﬁne-grained content selection algorithm.

But, though they did perform some analysis of this

additional step by comparing different salience de-

tection strategies, his evaluation focused on the full

pipeline, similarly to his predecessors.

There has also been some work on extracting

salient information in source documents in the form

of highlights (Cho et al.,2020;Arumae et al.,2019).

Yet, though acknowledging the full potential of us-

ing highlights to mark salient information in the

source document, it mainly focused on the process

of obtaining these highlights, overlooking its actual

usage in subsequent generation tasks, and in sum-

marization in particular. Moreover, these lines of

work focused solely on automatic highlight detec-

tion, lacking any crowdsourced annotation scheme.

There has also been work that pre-identiﬁed salient

parts as input to the generation phase (Chen and

Bansal,2018;Xu et al.,2020;Liu et al.,2021;

Deutsch and Roth,2021) But, contrary to our work,

the salience detection and generation tasks were

addressed and evaluated jointly, without assessing

the quality of each individual task.

All those research directions recognized the po-

tential of separating the summarization task into

subtasks and performing each subtask explicitly.

However, they all evaluated the subtasks jointly,

and in doing so overlooked the potential laying in

the optimization and characterization of each task

individually, and speciﬁcally the generation task

given content-selection. In this work, we propose

to isolate the generation task given pre-selected

content, treating it as a stand-alone task, thus pro-

moting focused evaluation and model designing.

3 Task Deﬁnition

We deﬁne the controlled Text Reduction task as

follows. Given a document and a set of marked

spans within that document, denoted as highlights,

produce a coherent output text encompassing only

the information provided within those highlights

(see Figure 1). The desirable output should ad-

here to two requirements beyond coherency: (1)

Its content has to be derived from the highlights

alone, keeping any additional document premises

to the minimum required for coherency; (2) The

output has to retain all of the details covered by the

highlighted spans.

Such requirements give rise to many interest-

ing challenges, such as recognizing the connecting

thread between disparate spans and faithfully repre-

senting the information contained within them. We

forgo a strict deﬁnition for a highlighted span and

allow possibly marking sub-sentence elements: an

entity or a clause, even discontinuous descriptions

of these (e.g., the last two highlights in Figure 1).

Figure 3: Illustration of Highlighting Annotation process for a summary sentence: [1] A summary fact is located

and highlighted; [2] The matching document spans are highlighted, and the alignment is saved; [3] Another sum-

mary fact is identiﬁed and highlighted; [4] The matching document spans are highlighted, and the alignment is

saved; [5] When the summary sentence is fully highlighted, we proceed to the next sentence, and so on. In this

example, the summary consists of two facts, but steps 1 and 2 can be repeated as needed per sentence, until all its

propositions (facts) are covered.

Hence, the input highlights may be disconnected

in both their surface realization (i.e. grammatically

unsuitable), and semantic ﬂuency.

Figure 1 features an input-output example. The

output covers exclusively and completely the high-

lighted information while using the source docu-

ment’s context to connect the disparate spans.

4 Gold Dataset for Evaluation

We leverage different summarization datasets to

annotate a high-quality dataset for the evaluation

of controlled-reduction systems. In summariza-

tion, every summary arises from a set of salient

document spans. Exploiting this in our annotation

process, we wish to "reverse-engineer" each sum-

mary and locate the spans in the document that led

to its construction. This signiﬁcantly reduces the

annotation complexity and load, instead of compil-

ing a new text given a set of highlighted spans, an

annotator has to highlight document spans given

the output text (i.e. the summary).

To create our development and test partitions we

sample 121 and 108 unique documents from DUC

2001 and 2002 Single-Document-Summarization

(SDS) datasets

respectively. Each document is

accompanied by up to 4 different reference sum-

maries (with an average of 2.14 summaries per doc-

ument), resulting in a total of 488 unique document-

summary pairs (see Table 1 for full statistics and

§Afor preprocessing details).

We build an intuitive and convenient annota-

tion tool for extracting highlights from document-

summary pairs

, designed to be embedded into

crowdsourcing platforms (see §4.1 and Figure 2).

Given the complexity of our task, we follow Roit

2https://duc.nist.gov/

3https://github.com/lovodkin93/

highlights-extract-app

et al. (2020)’s controlled crowdsourcing setup, in-

cluding principled steps of annotator recruitment

and training, leading to a trusted and qualiﬁed anno-

tators group, employed for the annotation process.

4.1 Annotation Process

To annotate document spans, whose content corre-

sponds to the summary content, we build a web-

based user interface that is published on Amazon

Mechanical Turk

and used by crowd-workers (see

Figure 2). An annotator is presented with a docu-

ment and its reference summary side-by-side and

is instructed to highlight all of the phrases in the

document whose content corresponds to the sum-

mary (see yellow background in Figure 2). To fa-

cilitate accurate and systematic processing of each

instance, workers are asked to align spans from the

summary that comprise a single fact to minimal

spans in the document which cover them. Thus,

annotators create a series of alignments that cover

every piece of information in the summary (see

Figure 3 for illustration of the annotation ﬂow).

We observed that processing summary text one

fact at a time substantially focuses the annotators’

attention and expedites the search for relevant spans

in the document. This is exempliﬁed when a single

sentence in the summary is comprised of details

that are mentioned in different locations spread out

across the source document (e.g., the ﬁrst summary

sentence in Figure 1). Further, to streamline the pro-

cess, we segment the document into paragraphs and

bolden content words in the document that share

the same lemma with words in the current sum-

mary sentence (see document side in Figure 2 and

also §Afor details). This method helps the human

annotator to skim quickly through the document

and is relatively bias-free. It is our assumption that

4www.mturk.com

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ControlledTextReductionAvivSlobodkin,PaulRoit,EranHirsch,OriErnst,IdoDaganBar-IlanUniversity{lovodkin93,plroit,hirsch.eran,oriern}@gmail.comdagan@cs.biu.ac.ilAbstractProducingareducedversionofasourcetext,asingenericorfocusedsummarization,in-herentlyinvolvestwodistinctsubtasks:decid-ingontargetedcont...

展开>> 收起<<

Controlled Text Reduction Aviv Slobodkin Paul Roit Eran Hirsch Ori Ernst Ido Dagan Bar-Ilan University.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Controlled Text Reduction Aviv Slobodkin Paul Roit Eran Hirsch Ori Ernst Ido Dagan Bar-Ilan University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: