Revision for Concision A Constrained Paraphrase Generation Task Wenchuan Mu Kwan Hui Lim Singapore University of Technology and Design

2025-05-01 0 0 673.95KB 20 页 10玖币
侵权投诉
Revision for Concision: A Constrained Paraphrase Generation Task
Wenchuan Mu Kwan Hui Lim
Singapore University of Technology and Design
{wenchuan_mu,kwanhui_lim}@sutd.edu.sg
Abstract
Academic writing should be concise as con-
cise sentences better keep the readers’ atten-
tion and convey meaning clearly. Writing con-
cisely is challenging, for writers often strug-
gle to revise their drafts. We introduce and
formulate revising for concision as a natural
language processing task at the sentence level.
Revising for concision requires algorithms to
use only necessary words to rewrite a sentence
while preserving its meaning. The revised
sentence should be evaluated according to its
word choice, sentence structure, and organiza-
tion. The revised sentence also needs to ful-
fil semantic retention and syntactic soundness.
To aide these efforts, we curate and make avail-
able a benchmark parallel dataset that can de-
pict revising for concision. The dataset con-
tains 536 pairs of sentences before and after re-
vising, and all pairs are collected from college
writing centres. We also present and evaluate
the approaches to this problem, which may as-
sist researchers in this area.
1 Introduction
Concision and clarity
1
are important in academic
writing as wordy sentences will obscure good ideas
(Figure 1). Concise writing encourages writers
to choose words deliberately and precisely, con-
struct sentences carefully to eliminate deadword,
and use grammar properly (Stanford University),
which often requires experience and time. A first
draft often contains far more words than neces-
sary, and achieving concise writing requires revi-
sions (MON,2020). As far as we know, currently
this revision process can only be done manually, or
semi-manually with the help of some rule-based
wordiness detectors (Adam and Long,2013). We
therefore introduce and formulate revising for con-
cision as a natural language processing (NLP) task
1
We treat concision and conciseness as equivalent, and
clarity as part of concision
Wordy
Concise
Wordy
Concise
As you carefully read what you have written to
improve your wording and catch small errors of
spelling, punctuation, and so on, the thing to do
before you do anything else is to try to see
where a series of words expressing action could
replace the ideas found in nouns rather than
verbs.
For example, in the field of image recognition,
experimental results on some standard test sets
indicate that the recognition capabilities of deep
learning models can already reach the level of
human intelligence.
As you edit, first find nominalizations that you
can replace with verb phrases.
For example, in the field of image recognition,
test results show that deep learning models can
already reach human intelligence.
Figure 1: Wordy sentences are more boring to read
than concise sentences. But how do we turn lengthy
sentences into concise ones? We show two examples.
The above sentence pair is taken from the Purdue Writ-
ing Lab, which suggests how college students should
succinctly revise their writing (PU). In the other ex-
ample, the wordy sentence comes from a scientific pa-
per (Chen et al.,2020), and its concise counterpart
is predicted from the concise revisioner we developed
(Section 5). In each pair, text with the same colour de-
livers the same information.
and address it. In this study, we make the following
contributions:
1. We formulate the revising for concision NLP
task at the sentence level, which reflects the
revising task in academic writing. We also
survey the differences between this task and
sentence compression, paraphrasing, etc.
2.
We release a corpus of 536 sentence pairs,
curated from 72 writing centres and addition-
ally coded with the various linguistic rules for
concise sentence revision.
3.
We propose an gloss-based Seq2Seq approach
arXiv:2210.14257v1 [cs.CL] 25 Oct 2022
to this problem, and conduct automatic and
human evaluations. We observed promising
preliminary results and we believe that our
findings will be useful for researchers working
in this area.
2 Problem Statement
2.1 Revision as an English Writing Task
Concise writing itself is a lesson that is often em-
phasized in colleges, and revision is crucial in writ-
ing. The following definitions are helpful when we
set out to formulate the task.
Definition 2.1
(Concise)
.
Marked by brevity of
expression or statement: free from all elaboration
and superfluous detail (Merriam-Webster).
Definition 2.2
(Concise writing, English)
.
Writing
that is clear and does not include unnecessary or
vague/unclear words or language (UOA).
Revising for concision at paragraph level, or
even article level, may be the best practice. How-
ever, sentence-level revising usually suffices. We
focus on revising for concision at the sentence level
now. Indeed, in many college academic writing tu-
torials, revisions for concision are for individual
sentences, and this process is defined as follows.
Definition 2.3
(Revise for concision at the sentence
level, English
2
)
.
Study a sentence in draft, use
specific strategies
3
to edit the sentence concisely
without losing meaning.
If someone, such as a college student, wants
to concisely modify a sentence, specific strategies
(e.g.,delete weak modifiers, replace phrasal verbs
with single verbs, or rewrite in active voice, etc.)
tell us how to locate wordiness and how to edit
it (PU;WU;UALR;UNZ;MON,2020). The rule
is to repeatedly detect wordiness and revise it until
no wordiness is detected or it cannot be removed
without adding new wordiness. The final product
serves as a concise version of the original sentence,
if it does not lose its meaning.
2.2 Task Definition in NLP
Now that we know how humans can revise a sen-
tence, what about programs? Each strategy is clear
to a trained college student, but not clear enough
to program in code. On the one hand, existing ver-
bosity detectors may suggest which part of a sen-
tence is too "dense" (Adam and Long,2013), but
2
Adapted from notes of PU Writing Lab and Rambo (2019)
3
Presented in Appendix (Table 4) as a periphery of this
study.
fail to expose fine-grained wordiness details. On
the other hand, how programs can edit sentences
without losing their meaning remains challenging.
In short, no existing program can generate well-
modified sentences in terms of concision.
Eager for a program that revises sentences nicely
and concisely, we set out to formulate this modifica-
tion process as a sequence-to-sequence (Seq2Seq)
NLP task. In this task, the input is any English sen-
tence and the output should be its concise version.
We define it as follows.
Definition 2.4
(Revise for concision at the sentence
level, NLP)
.
Produce a sentence where minimum
wordiness can be identified. (And,) the produced
sentence delivers the same information as input
does. (And,) the produced sentence is syntactically
correct.
As many other NLP tasks, e.g., machine transla-
tion, named-entity recognition, etc., Definition 2.4
describes the product (text) of a process, not the
process itself, i.e., how the text is produced. This
perspective is different from that of Definition 2.3.
Among the three components in Definition 2.4,
both the first and the third are clear and self-
contained. They are related to syntax; hence, at
least human experts would think it straightforward
to determine the soundness of a sentence on both.
For example, the syntax correctness of an English
sentence will not be judged differently by different
experts, unless the syntax itself changes. Unfor-
tunately, the second component is neither clear
nor self-contained. This component asks for in-
formation retention, which is a rule inherited from
Definition 2.3. Determining the semantic similarity
between texts has long been challenging, even for
human experts (Rus et al.,2014).
We then clarify the definition by assuming that
combining the second and third components in Def-
inition 2.4 meet the definition of the paraphrase
generation task (Rus et al.,2014). Henceforth, Def-
inition 2.4 can be simplified to Definition 2.5.
Definition 2.5
(Revise for concision at the sentence
level, NLP, simplified)
.
Produce a paraphrase
where minimum wordiness can be identified.
The revising
4
task is well-defined, as long as
"paraphrase generation" is well-defined. It is a
paraphrase generation task with a syntactic con-
straint.
4
stands for (machine) revising for concision if not other-
wise specified, so does revision
2.3 Task Performance Indicator
How does one approximately measure revision per-
formance? In principle, Definition 2.4 should be
used as a checklist. A good sample requires cor-
rect grammar (
γ
), complete information (
ρ
) and
reduced wordiness (
1ω
), assuming each com-
ponent as a float number between 0 and 1. The
overall assessment (
χ
) of the three components is
as follows,
χ=α2·(γ1) + α·(ρ1) + (1 ω),(1)
where
αR>1
is a large enough number, as we
believe that
γ
and
ρ
overweigh
1ω
. Intuitively, if
a revised sentence does not paraphrase the original
one, assessing the reduction of wordiness makes
little sense. Concision
χ
would always be negative
if γ < 1or ρ < 1.
Corresponding to the three components is a mix
of three tasks, including grammatical error correc-
tion for
g
, textual semantic similarity for
r
, and
wordiness detection for
w
. Unfortunately, both a
reference-free metric good enough to characterize
the paraphrase and a robust wordiness detector are
rare. Therefore, such assessment of concision is
now only feasible through human evaluation.
To enable automatic evaluation for faster feed-
back, we currently follow Papineni’s viewpoint (Pa-
pineni et al.,2002). The closer a machine revision
is to a professional human revision, the better it
is. To judge the quality of a machine revision, one
measures its closeness to one or more reference
human revisions according to a numerical metric.
Thus, our revising evaluation system requires two
main components:
1. A numerical "revision closeness" metric.
2.
A corpus of good quality human reference
revisions.
Different from days when Papineni needed to
propose a closeness metric, we can adopt various
metrics from machine translation and summariza-
tion community (Lin,2004;Banerjee and Lavie,
2005). Since it is certain which criterion corre-
lates best, we take multiple relevant and reasonable
metrics into account to estimate quality of revi-
sion. These metrics include those measuring higher
order n-grams precision (BLEU, Papineni et al.,
2002), explicit Word-matching, stem-matching,
or synonym-matching (METEOR, Banerjee and
Lavie,2005), surface bigram units overlapping
(ROUGE-2-F1, Lin,2004), cosine similarity be-
tween matched contextual words embeddings
(BERTScore-F1, Zhang et al.,2020b), edit distance
with single-word insertion, deletion, or replace-
ment (word error rate, Su et al.,1992), edit dis-
tance with block insertion, deletion, or replacement
(translation edit rate, Snover et al.,2006), and ex-
plicit goodness of words editing against reference
and source (SARI, Xu et al.,2016). In short, BLEU,
METEOR, ROUGE-2-F1, SARI, word error rate
and translation edit rate estimate sentence well-
formedness lexically; METEOR and BERTScore-
F1 consider semantic equivalence. Comparing
grammatical relations found in prediction with
those found in references can also measure seman-
tic similarity (Clarke and Lapata,2006b;Riezler
et al.,2003;Toutanova et al.,2016). Grammatical
relations are extracted from dependency parsing,
and F1 scores can then be used to measure overlap.
In contrast, the lack of good parallel corpus im-
pedes (machine) revising for concision. To address
this limitation, we curate and make available such
a corpus as benchmark. Each sample in the cor-
pus contains a wordy sentence, and at least one
sentence revised for concision. Samples are from
English writing centres of 57 universities, ten col-
leges, four community colleges, and a postgraduate
school.
3 Related Work
Manual revision operations include delete, replace,
and rewrite. Intuitively, a revising program should
do similar jobs, too. In fact, these actions are imple-
mented individually in various NLP tasks. For ex-
ample, sentence compression requires programs to
delete unnecessary words, and paraphrasing itself
is a matter of replacement. Machine revision for
concision could also share traits with them. Prac-
tically, when a neural model learns in a Seq2Seq
manner, the difference among these tasks is the
parallel dataset. We are also interested in whether
programs developed for these tasks can work in
machine revision.
3.1 Deleting as in Sentence Compression
When revising, deleting redundant words is com-
mon. For example, we can revise "research is
increasing in the field of nutrition and food sci-
ence" to "research is increasing in nutrition and
food science" (URI,2019), simply by deleting "the
field of ". Deleting is canonical in sentence com-
pression, a task aiming to reduce sentence length
from source sentences while retaining basic mean-
ing (Jing,2000;Knight and Marcu,2000;McDon-
ald,2006). For example, the compression task
has been formulated as integer linear programming
optimization using syntactic trees (Clarke and La-
pata,2006a), or as a sequence labelling optimiza-
tion problem using the recurrent neural networks
(RNN) (Filippova et al.,2015;Klerke et al.,2016;
Kamigaito et al.,2018). They explicitly or im-
plicitly use dependency grammar. Pre-trained lan-
guage models such as ELMo (Peters et al.,2018)
and BERT (Devlin et al.,2019) can encode fea-
tures apart from dependency parsing (Kamigaito
and Okumura,2020), bringing prediction and ref-
erence sentences closer.
All methods rely on parallel datasets labelling
parts to be deleted. However, the deleting part
in sentence compression differs from that in revi-
sion. Filippova and Altun (2013) created Google
dataset from titles and first sentence of news arti-
cles. The information retained in the first sentence
depends on the title. While this creation is useful
for reducing excessive information, the deleted part
is probably not wordiness.
Deleting does not solve everything in revision.
We can revise "in this report I will conduct a study
of ants and the setup of their colonies" to "in this
report I will study ants and their colonies", tak-
ing advantage of noun-and-verb homograph. How-
ever, a more concise version "this report stud-
ies ants" (Commnet) requires changing "study" to
third-person singular.
3.2 Replacing as in Paraphrase Generation
Word choice matters as well, thus we revise by
paraphrasing to stronger words. Paraphrase gen-
eration changes a sentence grammatically and re-
selects words, while retaining meaning. Paraphras-
ing matters in academic writing, for it helps avoid
plagiarism. Rule-based or statistical machine para-
phrasing substitutes words by finding synonyms
from lexical databases, and decodes syntax accord-
ing to template sentences. This rigid method may
undermine creativity (Bui et al.,2021). Pre-trained
neural language models like GPT (Radford et al.,
2019) or BART (Lewis et al.,2020) paraphrase
more accurately (Hegde and Patil,2020). Through
paraphrasing, we can replace verb phrase "con-
duct a study" to verb "study" in the example above,
rather than delete and rely on noun-and-verb homo-
graphs to keep the sentence syntactically correct.
Machine revision is a kind of paraphrase gen-
eration, and vice versa is not true. Current para-
phrase generation does not require concision in gen-
erated sentences. Automatically annotated datasets
for paraphrasing include ParaNMT (Wieting and
Gimpel,2018), Twitter (Lan et al.,2017), or re-
purposed noisy datasets such as MSCOCO (Lin
et al.,2014) and WikiAnswers (Fader et al.,2013).
We may adapt paraphrase parallel datasets to train
revising models, as investigated in Section 5.
3.3 Other related tasks
Summarization produces a shorter text of one or
several documents, while retaining most of mean-
ing (Paulus et al.,2018). This is similar to sen-
tence compression. In practice, summarization
welcomes novel words, allows specifying output
length (Kikuchi et al.,2016), and removes much
more information than sentence compression does.
Datasets include XSum (Narayan et al.,2018) , CN-
N/DM (Hermann et al.,2015), WikiHow (Koupaee
and Wang,2018), NYT (Sandhaus,2008), DUC-
2004 (Over et al.,2007), and Gigaword (Rush et al.,
2015), where summaries are generally shorter than
one-tenth of documents. On the other hand, sen-
tence summarization (Chopra et al.,2016) uses
summarization methods on sentence compression
datasets, retaining more information and possibly
generating new words.
Text simplification modifies vocabulary and syn-
tax for easier reading, while retaining approxi-
mate meaning (Omelianchuk et al.,2021). Hand-
crafted syntactic rules (Siddharthan,2006;Car-
roll et al.,1999;Chandrasekar et al.,1996) and
aligned sentences-driven simplification (Yatskar
et al.,2010) have been explored. Corpora such
as Turk (Xu et al.,2016) and PWKP (Zhu et al.,
2010) are compiled from Wikipedia and Simple
English Wikipedia (Coster and Kauchak,2011).
Rules for simplification may deviate from that for
revision, e.g., text simplification sometimes encour-
ages prepositional phrases (Xu et al.,2016). Still,
adapting these approaches may benefit academic
revising for concision.
Fluency editing (Napoles et al.,2017) not only
corrects grammatical errors but paraphrases text to
be more native sounding as well. Its paraphrasing
section is constrained such that outputs represent a
higher level of English proficiency than inputs. As
a constrained paraphrase task, fluency editing may
摘要:

RevisionforConcision:AConstrainedParaphraseGenerationTaskWenchuanMuKwanHuiLimSingaporeUniversityofTechnologyandDesign{wenchuan_mu,kwanhui_lim}@sutd.edu.sgAbstractAcademicwritingshouldbeconciseascon-cisesentencesbetterkeepthereaders'atten-tionandconveymeaningclearly.Writingcon-ciselyischallenging,for...

展开>> 收起<<
Revision for Concision A Constrained Paraphrase Generation Task Wenchuan Mu Kwan Hui Lim Singapore University of Technology and Design.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:673.95KB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注