
ARXIVEDITS: Understanding the Human Revision Process in
Scientific Writing
Chao Jiang1, Wei Xu1, Samuel Stevens2∗
1School of Interactive Computing, Georgia Institute of Technology
2Department of Computer Science and Engineering, Ohio State University
chaojiang@gatech.edu wei.xu@cc.gatech.edu stevens.994@osu.edu
Abstract
Scientific publications are the primary means
to communicate research discoveries, where
the writing quality is of crucial importance.
However, prior work studying the human edit-
ing process in this domain mainly focused on
the abstract or introduction sections, resulting
in an incomplete picture. In this work, we
provide a complete computational framework
for studying text revision in scientific writing.
We first introduce ARXIVEDITS, a new anno-
tated corpus of 751 full papers from arXiv with
gold sentence alignment across their multiple
versions of revision, as well as fine-grained
span-level edits and their underlying intentions
for 1,000 sentence pairs. It supports our data-
driven analysis to unveil the common strate-
gies practiced by researchers for revising their
papers. To scale up the analysis, we also de-
velop automatic methods to extract revision at
document-, sentence-, and word-levels. A neu-
ral CRF sentence alignment model trained on
our corpus achieves 93.8 F1, enabling the reli-
able matching of sentences between different
versions. We formulate the edit extraction task
as a span alignment problem, and our proposed
method extracts more fine-grained and explain-
able edits, compared to the commonly used
diff algorithm. An intention classifier trained
on our dataset achieves 78.9 F1 on the fine-
grained intent classification task. Our data and
system are released at tiny.one/arxivedits.
1 Introduction
Writing is essential for sharing scientific findings.
Researchers devote a huge amount of effort to revis-
ing their papers by improving the writing quality or
updating new discoveries. Valuable knowledge is
encoded in this revision process. Up to January 1st,
2022, arXiv (
https://arxiv.org/
), an open access e-
print service, has archived over 1.9 million papers,
among which more than 600k papers have multiple
versions available. This provides an amazing data
∗Work done as an undergraduate student.
source for studying text revision in scientific writ-
ing. Specifically, revisions between different ver-
sions of papers contain valuable information about
logical and structural improvements at document-
level, as well as stylistic and grammatical refine-
ments at sentence- and word-levels. It also can
support various natural language processing (NLP)
applications, including writing quality assessment
and error correction (Louis and Nenkova,2013;
Xue and Hwa,2014;Daudaravicius et al.,2016;
Bryant et al.,2019), text simplification and com-
pression (Xu et al.,2015;Filippova et al.,2015),
style transfer (Xu et al.,2012;Krishna et al.,2020),
hedge detection (Medlock and Briscoe,2007), and
paraphrase generation (Dou et al.,2022).
In this paper, we present a complete solution for
studying the human revision process in the scien-
tific writing domain, including annotated data, anal-
ysis, and system. We first construct ARXIVEDITS,
which consists of 751 full arXiv papers with gold
sentence alignment across their multiple versions
of revisions, as shown in Figure 1. Our corpus
spans 6 research areas, including physics, math-
ematics, computer science, quantitative biology,
quantitative finance, and statistics, published in
23 years (from 1996 to 2019). To the best of our
knowledge, this is the first text revision corpus
that covers full multi-page research papers. To
study sentence-level revision, we manually anno-
tated fine-grained edits and their underlying inten-
tions that reflect why the edits are being made for
1,000 sentence pairs, based on a taxonomy that we
developed consisting of 7 categories.
Our dataset addresses two major limitations in
prior work. First, previous researchers mainly fo-
cus on the abstract (Gábor et al.,2018;Kang et al.,
2018;Du et al.,2022) and introduction (Tan and
Lee,2014;Mita et al.,2022) sections, limiting the
generalizability of their conclusions. In addition,
a sentence-level revision may consist of multiple
fine-grained edits made for different purposes (see
arXiv:2210.15067v2 [cs.CL] 31 Oct 2022