ARXIVEDITS Understanding the Human Revision Process in Scientiﬁc Writing Chao Jiang1 Wei Xu1 Samuel Stevens2

2025-05-06 0 0 2MB 16 页 10玖币

侵权投诉

ARXIVEDITS: Understanding the Human Revision Process in

Scientiﬁc Writing

Chao Jiang1, Wei Xu1, Samuel Stevens2∗

1School of Interactive Computing, Georgia Institute of Technology

2Department of Computer Science and Engineering, Ohio State University

chaojiang@gatech.edu wei.xu@cc.gatech.edu stevens.994@osu.edu

Abstract

Scientiﬁc publications are the primary means

to communicate research discoveries, where

the writing quality is of crucial importance.

However, prior work studying the human edit-

ing process in this domain mainly focused on

the abstract or introduction sections, resulting

in an incomplete picture. In this work, we

provide a complete computational framework

for studying text revision in scientiﬁc writing.

We ﬁrst introduce ARXIVEDITS, a new anno-

tated corpus of 751 full papers from arXiv with

gold sentence alignment across their multiple

versions of revision, as well as ﬁne-grained

span-level edits and their underlying intentions

for 1,000 sentence pairs. It supports our data-

driven analysis to unveil the common strate-

gies practiced by researchers for revising their

papers. To scale up the analysis, we also de-

velop automatic methods to extract revision at

document-, sentence-, and word-levels. A neu-

ral CRF sentence alignment model trained on

our corpus achieves 93.8 F1, enabling the reli-

able matching of sentences between different

versions. We formulate the edit extraction task

as a span alignment problem, and our proposed

method extracts more ﬁne-grained and explain-

able edits, compared to the commonly used

diff algorithm. An intention classiﬁer trained

on our dataset achieves 78.9 F1 on the ﬁne-

grained intent classiﬁcation task. Our data and

system are released at tiny.one/arxivedits.

1 Introduction

Writing is essential for sharing scientiﬁc ﬁndings.

Researchers devote a huge amount of effort to revis-

ing their papers by improving the writing quality or

updating new discoveries. Valuable knowledge is

encoded in this revision process. Up to January 1st,

2022, arXiv (

https://arxiv.org/

), an open access e-

print service, has archived over 1.9 million papers,

among which more than 600k papers have multiple

versions available. This provides an amazing data

∗Work done as an undergraduate student.

source for studying text revision in scientiﬁc writ-

ing. Speciﬁcally, revisions between different ver-

sions of papers contain valuable information about

logical and structural improvements at document-

level, as well as stylistic and grammatical reﬁne-

ments at sentence- and word-levels. It also can

support various natural language processing (NLP)

applications, including writing quality assessment

and error correction (Louis and Nenkova,2013;

Xue and Hwa,2014;Daudaravicius et al.,2016;

Bryant et al.,2019), text simpliﬁcation and com-

pression (Xu et al.,2015;Filippova et al.,2015),

style transfer (Xu et al.,2012;Krishna et al.,2020),

hedge detection (Medlock and Briscoe,2007), and

paraphrase generation (Dou et al.,2022).

In this paper, we present a complete solution for

studying the human revision process in the scien-

tiﬁc writing domain, including annotated data, anal-

ysis, and system. We ﬁrst construct ARXIVEDITS,

which consists of 751 full arXiv papers with gold

sentence alignment across their multiple versions

of revisions, as shown in Figure 1. Our corpus

spans 6 research areas, including physics, math-

ematics, computer science, quantitative biology,

quantitative ﬁnance, and statistics, published in

23 years (from 1996 to 2019). To the best of our

knowledge, this is the ﬁrst text revision corpus

that covers full multi-page research papers. To

study sentence-level revision, we manually anno-

tated ﬁne-grained edits and their underlying inten-

tions that reﬂect why the edits are being made for

1,000 sentence pairs, based on a taxonomy that we

developed consisting of 7 categories.

Our dataset addresses two major limitations in

prior work. First, previous researchers mainly fo-

cus on the abstract (Gábor et al.,2018;Kang et al.,

2018;Du et al.,2022) and introduction (Tan and

Lee,2014;Mita et al.,2022) sections, limiting the

generalizability of their conclusions. In addition,

a sentence-level revision may consist of multiple

ﬁne-grained edits made for different purposes (see

arXiv:2210.15067v2 [cs.CL] 31 Oct 2022

A Paragraph in Early Draft The Paragraph in Final Version

Revise

Split &

Revise

t1 Energy markets are driven by innovation, path-dependent technology

choices and diffusion. t2 However, conventional optimisation models

lack detail on these aspects and have limited ability to address the

effectiveness of policy interventions because they do not represent

decision-making. t3 As a result, known effects of technology lock-ins are

liable to be underestimated. t4 In contrast, our approach places

investor decision-making …

Deletion

Insertion

s1 Energy markets are driven by innovation, path-dependent technology

costs and diffusion; yet, common optimisation modelling methodologies

remain vague on these aspects and have a limited ability to address the

effectiveness of policy onto decision-making since the latter is not

speciﬁcally represented. s2 This leads to an underestimation of non-

cost-optimal technology lock-ins known to occur. s3 Breaking with

tradition, our approach explores bottom-up …

Operation

Cost-optimisation!technology!models!,!correspondence with![CITATION]!,!is!the!most!powerful!for!ﬁnding!with outstanding detail!lowest cost!future!technology!pathways.

Cost-optimisation technology models , corresponding to![CITATION] , are still the most powerful tools for ﬁnding detailed , lowest-cost future technology pathways that!reaches

particular!objectives in normative mode .

Document-level Revision:

Sentence-level Revision with Intention:

Improve Style

More Accurate

Simplify

Fix Typo

Adjust Format

Update Content

Fix Grammar

Improve Style

Figure 1: Our ARXIVEDITS corpus consists of both document-level revision (top) and sentence-level revision with

intention (bottom). The top part shows an aligned paragraph pair from the original and revised papers, where s1

and t1 denote the corresponding sentences. For sentence-level revision, the ﬁne-grained edits and each of their

intentions are manually annotated.

an example in Figure 1). Whereas previous work

either concentrates on the change of a single word

or phrase (Faruqui et al.,2018;Pryzant et al.,2020)

or extracts edits using the

diff

algorithm (Myers,

1986), which is based on minimizing the edit dis-

tance regardless of semantic meaning. As a result,

the extracted edits are coarse-grained, and the inten-

tions annotated on top of them can be ambiguous.

Enabled by our high-quality annotated corpus,

we perform a series of data-driven studies to an-

swer: what common strategies are used by authors

to improve the writing of their papers? We also

provide a pipeline system with 3 modules to auto-

matically extract and analyze revisions at all levels.

(1) A neural sentence alignment model trained on

our data achieves 93.8 F1. It can be reliably used

to extract parallel corpus for text-to-text generation

tasks. (2) Within a revised sentence pair, the edit

extraction is formulated as a span alignment task,

and our method can extract more ﬁne-grained and

explainable edits compared to the

diff

algorithm.

(3) An intention classiﬁer trained on our corpus

achieves 78.9 F1 on the ﬁne-grained classiﬁcation

task, enabling us to scale up the analysis by au-

tomatically extracting and classifying span-level

edits from the unlabeled revision data. We hope

our work will inspire other researchers to further

study the task of text revision in academic writing.

2 Constructing ARXIVEDITS Corpus

In this section, we present the detailed procedure

for constructing the ARXIVEDITS corpus. After

posting preprints on arXiv, researchers can continu-

ally update the submission, and that constitutes

the revisions. More speciﬁcally, a revision de-

notes two adjacent versions of the same paper.

An article group refers to all versions of a paper

on arXiv (e.g., v1, v2, v3, v4). In this work, we

refer to the changes applied to tokens or phrases

within one sentence as sentence-level revision. The

document-level revision refers to the change of an

entire or several sentences, and the changes to the

paragraphs can be derived from sentences. Table

1presents the statistics of document-level revision

in our corpus. After constructing this manually an-

notated corpus, we use it to train the 3 modules in

our automatic system as detailed at $4.

2.1 Data Collection and Preprocessing

We ﬁrst collect metadata for all 1.6 million papers

posted on arXiv between March 1996 and Decem-

ber 2019. We then randomly select 1,000 article

groups from the 600k papers that have more than

one versions available. To extract plain text from

the LaTeX source code of these papers, we im-

proved the open-source OpenDetex

package to

better handle macros, user-deﬁned commands, and

additional LaTeX ﬁles imported by the input com-

mands in the main ﬁle.

We ﬁnd this method is less

error-prone for extracting plain text, compared to

For example, the paper titled “Attention Is All You

Need” (

https://arxiv.org/abs/1706.03762

) has ﬁve ver-

sions on arXiv submitted by the authors, constituting four

revisions (v1-v2, v2-v3, v3-v4, v4-v5).

2https://github.com/pkubowicz/opendetex

Our code is released at

https://tiny.one/arxivedits

using other libraries such as Pandoc

used in (Co-

han et al.,2018;Roush and Balaji,2020). Among

the randomly selected 1,000 article groups, we ob-

tained plain texts for 751 complete groups, with a

total of 1,790 versions of papers, that came with

the original LaTex source code and contained text

content that was understandable without an over-

whelming number of math equations. A breakdown

of the ﬁltered groups is provided in Appendix A.

2.2 Paragraph and Sentence Alignment

Sentence alignment can capture all document-level

revision operations, including the insertion, dele-

tion, rephrasing, splitting, merging, and reordering

of sentences and paragraphs (see Figure 1for an ex-

ample). Therefore, we propose the following 2-step

annotation method to manually align sentences for

papers in the 1,039 adjacent version pairs (e.g., v0-

v1, v1-v2) from the 751 selected article groups, and

the alignments between non-adjacent version pairs

(e.g., v0-v2) then can be derived automatically.

Align paragraphs using a light-weighted align-

ment algorithm that we designed based on Jac-

card similarity (Jaccard,1912) (more details

in Appendix B). It can cover 92.1% of non-

identical aligned sentence pairs, based on a

pilot study on 18 article pairs. Aligning para-

graphs ﬁrst signiﬁcantly reduces the number

of sentence pairs that need to be annotated.

Collect annotation of sentence alignment for

every possible pair of sentences in the aligned

paragraphs using Figure-Eight

, a crowdsourc-

ing platform. We ask 5 annotators to classify

each pair into one of the following categories:

aligned,partially-aligned, or not-aligned. An-

notators are required to spend at least 25 sec-

onds on each question. The annotation instruc-

tions and interface can be found in Appendix

D. We embed one hidden test question in ev-

ery ﬁve questions, and the workers need to

maintain an accuracy over 85% on the test

questions to continue working on the task.

We skip aligning 5.8% sentences that contain too

few words or too many special tokens. They are

still retained in the dataset for completeness, and

are marked with a special token. More details about

the annotation process are in Appendix Aand B. In

total, we spent $3,776 to annotate 13,008 sentence

pairs from 751 article groups, with a 526/75/150

4https://pandoc.org/

5https://www.figure-eight.com/

Operation at Document-level Count

# of sent. insertion (0-to-1) 25,229

# of sent. deletion (1-to-0) 17,315

# of sent. rephrasing (1-to-1) 17,755

# of sent. splitting (1-to-n) 378

# of sent. merging (n-to-1) 269

# of sent. fusion (m-to-n) 142

# of sent. copying (1-to-1) 95,110

Table 1: Statistics of document-level revision in our

ARXIVEDITS corpus, based on manually annotated

sentence alignment.

split for train/dev/test sets in the experiments of

automatic sentence alignment in

4. The inter-

annotator agreement is 0.614 measured by Cohen’s

kappa (Artstein and Poesio,2008). To verify the

crowd-sourcing annotation quality, an in-house an-

notator manually aligns sentences for 10 randomly

sampled groups with 14 article pairs. If assuming

the in-house annotation is gold, the majority vote

of crowd-sourcing annotation achieves an F1 of

94.2 on these 10 paper groups.

2.3 Fine-grained Edits with Varied Intentions

Sentence-level revision involves the insertion, dele-

tion, substitution, and reordering of words and

phrases. Multiple edits may be tangled together in

one sentence, while each edit is made for different

purposes (see an example in Figure 1). Correctly

detecting and classifying these edits is a challeng-

ing problem. We ﬁrst introduce the formal deﬁni-

tion of edits and our proposed intention taxonomy,

followed by the annotation procedure.

Deﬁnition of Span-level Edits.

A sentence-level

revision

consists of the original sentence

, tar-

get sentence

, and a series of ﬁne-grained edits

. Each edit

is deﬁned as a tuple

(sa:b,tc:d,I)

indicating span

[sa, sa+1, ..., sb]

in the original sen-

tence is transformed into span

[tc, tc+1, ..., td]

the target sentence, with an intention label

I∈ I

(deﬁned in Table 2). The type of edit can be recog-

nized by spans

sa:b

and

tc:d

, where

sa:b= [NULL]

indicating insertion,

tc:d= [NULL]

for deletion,

sa:b=tc:d

representing reordering, and

sa:b6=tc:d

for substitution.

Edit Intention Taxonomy.

We propose a new

taxonomy to comprehensively capture the intention

of text revision in the scientiﬁc writing domain, as

shown in Table 2. Each edit is classiﬁed into one of

the following categories: Improve Language,Cor-

rect Grammar/Typo,Update Content, and Adjust

Format. Since our goal is to improve the writing

Intention Label Deﬁnition Example %

Improve Language 28.6%

More Accurate/Speciﬁc Minor adjustment to improve the accuracy or

speciﬁcness of the description.

Further, we suggest a relativistic-invariant protocol

for quantum information processing communication. 11.5%

Improve Style Make the text sound more professional or

coherent without altering the meaning.

. . . due to hydrodynamic interactions among cells

in addition with besides self-generated force . . . 8.7%

Simplify Simplify complex concepts or delete redundant

content to improve readability.

These include new transceiver architecture ( TXRU

array connected architecture ) . . . 7.6%

Other Other language improvements that don’t fall into

the above categories.

. . . due to changes in fuels used , or , in other words ,

associated to changes of technologies . 0.8%

Correct Grammar/Typo Fix grammatical errors, correct typos, or smooth

out grammar needed by other changes.

Not Note that the investigator might reconstruct each

function . . . 25.4%

Update Content Update large amount of scientiﬁc content, add or

delete major fact.

. . . characterized by long range hydrodynamic term

and self-generated force due to actin remodeling. 28.8%

Adjust Format Adjust table, ﬁgure, equation, reference, citation,

and punctuation etc.

Similarly to what we did in Figure Fig. [REF] , the

statistical results obtained by means of . . . 17.2%

Table 2: A taxonomy (I) of edit intentions in scientiﬁc writing revisions. In each example, text with red background

denotes the edit. Span with strike-through means the content got deleted, otherwise is inserted.

quality, we further break the Improve Language

type into four ﬁne-grained categories. During the

design, we extensively consult prior literature in

text revision (Faigley and Witte,1981;Fitzger-

ald,1987;Daxenberger,2016), edit categorization

(Bronner and Monz,2012;Yang et al.,2017), and

analysis in related areas such as Wikipedia (Dax-

enberger and Gurevych,2013) and argumentative

essays (Zhang et al.,2017). The taxonomy is im-

proved for several rounds based on the feedback

from four NLP researchers and two in-house anno-

tators with linguistic background.

Annotating Edits.

In pilot study, we found that

directly annotating ﬁne-grained edits is a tedious

and complicated task for annotators, as it requires

separating and matching edited spans across two

sentences. To assist the annotators, we use mono-

lingual word alignment (Lan et al.,2021), which

can ﬁnd the correspondences between words and

phrases with a similar meaning in two sentences,

as an intermediate step to reduce the cognitive load

during annotation. We ﬁnd that, compared to strict

word-to-word matching, edits usually have larger

granularity and may cross linguistic boundaries.

For example, in Figure 1, “corresponding to” and

“correspondence with” should be treated as a whole

to be meaningful and labeled an intention. There-

fore, the edits can be annotated by adjusting the

boundaries of the span alignment. We propose the

following 2-step method that leverages word align-

ment to assist the annotation of edits:

Collect word alignment annotation by asking

in-house annotators to manually correct the

automatic word alignment generated by the

neural semi-CRF word alignment model (Lan

et al.,2021). The aligner is trained on the

MTRef dataset and achieves state-of-the-art

performance on the monolingual word align-

ment task with 92.4 F1.

Annotate edits by having in-house annotators

inspect and correct the ﬁne-grained edits that

are extracted from word alignment using sim-

ple heuristics. The heuristics are detailed in

4.1. Two principles are followed during the

correction: (1) Each edit should have a clear

intention and relatively clear phrase bound-

aries; (2) Span pairs in substitution should

be semantically related, otherwise should be

treated as separated insertion and deletion.

We manually annotate insertion, deletion, substi-

tution, and derive reordering automatically, since

it can be reliably found by heuristics. Due to the

slight variance in granularity, it is possible that

more than one answer is acceptable. Therefore, we

include all the alternative edits for sentence pairs

in the dev and test sets in our annotation, among

which 16% have more than one answer.

Overall, we found that our method can annotate

more accurate and ﬁne-grained edits compared to

prior work that uses the

diff

algorithm. The

diff

method is based on minimizing the edit distance

regardless of semantic meaning. Therefore, the

extracted edits are coarse-grained and may contain

many errors (detailed in Table 3).

Annotating Intention.

As intentions can differ

subtly, correctly identifying them is a challenging

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ARXIVEDITS:UnderstandingtheHumanRevisionProcessinScienticWritingChaoJiang1,WeiXu1,SamuelStevens21SchoolofInteractiveComputing,GeorgiaInstituteofTechnology2DepartmentofComputerScienceandEngineering,OhioStateUniversitychaojiang@gatech.eduwei.xu@cc.gatech.edustevens.994@osu.eduAbstractScienticpublic...

展开>> 收起<<

ARXIVEDITS Understanding the Human Revision Process in Scientiﬁc Writing Chao Jiang1 Wei Xu1 Samuel Stevens2.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ARXIVEDITS Understanding the Human Revision Process in Scientiﬁc Writing Chao Jiang1 Wei Xu1 Samuel Stevens2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: