ARXIVEDITS Understanding the Human Revision Process in Scientific Writing Chao Jiang1 Wei Xu1 Samuel Stevens2

2025-05-06 0 0 2MB 16 页 10玖币
侵权投诉
ARXIVEDITS: Understanding the Human Revision Process in
Scientific Writing
Chao Jiang1, Wei Xu1, Samuel Stevens2
1School of Interactive Computing, Georgia Institute of Technology
2Department of Computer Science and Engineering, Ohio State University
chaojiang@gatech.edu wei.xu@cc.gatech.edu stevens.994@osu.edu
Abstract
Scientific publications are the primary means
to communicate research discoveries, where
the writing quality is of crucial importance.
However, prior work studying the human edit-
ing process in this domain mainly focused on
the abstract or introduction sections, resulting
in an incomplete picture. In this work, we
provide a complete computational framework
for studying text revision in scientific writing.
We first introduce ARXIVEDITS, a new anno-
tated corpus of 751 full papers from arXiv with
gold sentence alignment across their multiple
versions of revision, as well as fine-grained
span-level edits and their underlying intentions
for 1,000 sentence pairs. It supports our data-
driven analysis to unveil the common strate-
gies practiced by researchers for revising their
papers. To scale up the analysis, we also de-
velop automatic methods to extract revision at
document-, sentence-, and word-levels. A neu-
ral CRF sentence alignment model trained on
our corpus achieves 93.8 F1, enabling the reli-
able matching of sentences between different
versions. We formulate the edit extraction task
as a span alignment problem, and our proposed
method extracts more fine-grained and explain-
able edits, compared to the commonly used
diff algorithm. An intention classifier trained
on our dataset achieves 78.9 F1 on the fine-
grained intent classification task. Our data and
system are released at tiny.one/arxivedits.
1 Introduction
Writing is essential for sharing scientific findings.
Researchers devote a huge amount of effort to revis-
ing their papers by improving the writing quality or
updating new discoveries. Valuable knowledge is
encoded in this revision process. Up to January 1st,
2022, arXiv (
https://arxiv.org/
), an open access e-
print service, has archived over 1.9 million papers,
among which more than 600k papers have multiple
versions available. This provides an amazing data
Work done as an undergraduate student.
source for studying text revision in scientific writ-
ing. Specifically, revisions between different ver-
sions of papers contain valuable information about
logical and structural improvements at document-
level, as well as stylistic and grammatical refine-
ments at sentence- and word-levels. It also can
support various natural language processing (NLP)
applications, including writing quality assessment
and error correction (Louis and Nenkova,2013;
Xue and Hwa,2014;Daudaravicius et al.,2016;
Bryant et al.,2019), text simplification and com-
pression (Xu et al.,2015;Filippova et al.,2015),
style transfer (Xu et al.,2012;Krishna et al.,2020),
hedge detection (Medlock and Briscoe,2007), and
paraphrase generation (Dou et al.,2022).
In this paper, we present a complete solution for
studying the human revision process in the scien-
tific writing domain, including annotated data, anal-
ysis, and system. We first construct ARXIVEDITS,
which consists of 751 full arXiv papers with gold
sentence alignment across their multiple versions
of revisions, as shown in Figure 1. Our corpus
spans 6 research areas, including physics, math-
ematics, computer science, quantitative biology,
quantitative finance, and statistics, published in
23 years (from 1996 to 2019). To the best of our
knowledge, this is the first text revision corpus
that covers full multi-page research papers. To
study sentence-level revision, we manually anno-
tated fine-grained edits and their underlying inten-
tions that reflect why the edits are being made for
1,000 sentence pairs, based on a taxonomy that we
developed consisting of 7 categories.
Our dataset addresses two major limitations in
prior work. First, previous researchers mainly fo-
cus on the abstract (Gábor et al.,2018;Kang et al.,
2018;Du et al.,2022) and introduction (Tan and
Lee,2014;Mita et al.,2022) sections, limiting the
generalizability of their conclusions. In addition,
a sentence-level revision may consist of multiple
fine-grained edits made for different purposes (see
arXiv:2210.15067v2 [cs.CL] 31 Oct 2022
A Paragraph in Early Draft The Paragraph in Final Version
Revise
Split &
Revise
t1 Energy markets are driven by innovation, path-dependent technology
choices and diffusion. t2 However, conventional optimisation models
lack detail on these aspects and have limited ability to address the
effectiveness of policy interventions because they do not represent
decision-making. t3 As a result, known effects of technology lock-ins are
liable to be underestimated. t4 In contrast, our approach places
investor decision-making …
Deletion
Insertion
t1
t2
t3
t4
s1 Energy markets are driven by innovation, path-dependent technology
costs and diffusion; yet, common optimisation modelling methodologies
remain vague on these aspects and have a limited ability to address the
effectiveness of policy onto decision-making since the latter is not
specifically represented. s2 This leads to an underestimation of non-
cost-optimal technology lock-ins known to occur. s3 Breaking with
tradition, our approach explores bottom-up …
s3
s2
s1
Operation
Cost-optimisation!technology!models!,!correspondence with![CITATION]!,!is!the!most!powerful!for!finding!with outstanding detail!lowest cost!future!technology!pathways.
Cost-optimisation technology models , corresponding to![CITATION] , are still the most powerful tools for finding detailed , lowest-cost future technology pathways that!reaches
particular!objectives in normative mode .
Document-level Revision:
Sentence-level Revision with Intention:
Improve Style
More Accurate
Simplify
Fix Typo
Adjust Format
Update Content
Improve Style
Figure 1: Our ARXIVEDITS corpus consists of both document-level revision (top) and sentence-level revision with
intention (bottom). The top part shows an aligned paragraph pair from the original and revised papers, where s1
and t1 denote the corresponding sentences. For sentence-level revision, the fine-grained edits and each of their
intentions are manually annotated.
an example in Figure 1). Whereas previous work
either concentrates on the change of a single word
or phrase (Faruqui et al.,2018;Pryzant et al.,2020)
or extracts edits using the
diff
algorithm (Myers,
1986), which is based on minimizing the edit dis-
tance regardless of semantic meaning. As a result,
the extracted edits are coarse-grained, and the inten-
tions annotated on top of them can be ambiguous.
Enabled by our high-quality annotated corpus,
we perform a series of data-driven studies to an-
swer: what common strategies are used by authors
to improve the writing of their papers? We also
provide a pipeline system with 3 modules to auto-
matically extract and analyze revisions at all levels.
(1) A neural sentence alignment model trained on
our data achieves 93.8 F1. It can be reliably used
to extract parallel corpus for text-to-text generation
tasks. (2) Within a revised sentence pair, the edit
extraction is formulated as a span alignment task,
and our method can extract more fine-grained and
explainable edits compared to the
diff
algorithm.
(3) An intention classifier trained on our corpus
achieves 78.9 F1 on the fine-grained classification
task, enabling us to scale up the analysis by au-
tomatically extracting and classifying span-level
edits from the unlabeled revision data. We hope
our work will inspire other researchers to further
study the task of text revision in academic writing.
2 Constructing ARXIVEDITS Corpus
In this section, we present the detailed procedure
for constructing the ARXIVEDITS corpus. After
posting preprints on arXiv, researchers can continu-
ally update the submission, and that constitutes
the revisions. More specifically, a revision de-
notes two adjacent versions of the same paper.
1
An article group refers to all versions of a paper
on arXiv (e.g., v1, v2, v3, v4). In this work, we
refer to the changes applied to tokens or phrases
within one sentence as sentence-level revision. The
document-level revision refers to the change of an
entire or several sentences, and the changes to the
paragraphs can be derived from sentences. Table
1presents the statistics of document-level revision
in our corpus. After constructing this manually an-
notated corpus, we use it to train the 3 modules in
our automatic system as detailed at $4.
2.1 Data Collection and Preprocessing
We first collect metadata for all 1.6 million papers
posted on arXiv between March 1996 and Decem-
ber 2019. We then randomly select 1,000 article
groups from the 600k papers that have more than
one versions available. To extract plain text from
the LaTeX source code of these papers, we im-
proved the open-source OpenDetex
2
package to
better handle macros, user-defined commands, and
additional LaTeX files imported by the input com-
mands in the main file.
3
We find this method is less
error-prone for extracting plain text, compared to
1
For example, the paper titled “Attention Is All You
Need” (
https://arxiv.org/abs/1706.03762
) has five ver-
sions on arXiv submitted by the authors, constituting four
revisions (v1-v2, v2-v3, v3-v4, v4-v5).
2https://github.com/pkubowicz/opendetex
3
Our code is released at
https://tiny.one/arxivedits
using other libraries such as Pandoc
4
used in (Co-
han et al.,2018;Roush and Balaji,2020). Among
the randomly selected 1,000 article groups, we ob-
tained plain texts for 751 complete groups, with a
total of 1,790 versions of papers, that came with
the original LaTex source code and contained text
content that was understandable without an over-
whelming number of math equations. A breakdown
of the filtered groups is provided in Appendix A.
2.2 Paragraph and Sentence Alignment
Sentence alignment can capture all document-level
revision operations, including the insertion, dele-
tion, rephrasing, splitting, merging, and reordering
of sentences and paragraphs (see Figure 1for an ex-
ample). Therefore, we propose the following 2-step
annotation method to manually align sentences for
papers in the 1,039 adjacent version pairs (e.g., v0-
v1, v1-v2) from the 751 selected article groups, and
the alignments between non-adjacent version pairs
(e.g., v0-v2) then can be derived automatically.
1.
Align paragraphs using a light-weighted align-
ment algorithm that we designed based on Jac-
card similarity (Jaccard,1912) (more details
in Appendix B). It can cover 92.1% of non-
identical aligned sentence pairs, based on a
pilot study on 18 article pairs. Aligning para-
graphs first significantly reduces the number
of sentence pairs that need to be annotated.
2.
Collect annotation of sentence alignment for
every possible pair of sentences in the aligned
paragraphs using Figure-Eight
5
, a crowdsourc-
ing platform. We ask 5 annotators to classify
each pair into one of the following categories:
aligned,partially-aligned, or not-aligned. An-
notators are required to spend at least 25 sec-
onds on each question. The annotation instruc-
tions and interface can be found in Appendix
D. We embed one hidden test question in ev-
ery five questions, and the workers need to
maintain an accuracy over 85% on the test
questions to continue working on the task.
We skip aligning 5.8% sentences that contain too
few words or too many special tokens. They are
still retained in the dataset for completeness, and
are marked with a special token. More details about
the annotation process are in Appendix Aand B. In
total, we spent $3,776 to annotate 13,008 sentence
pairs from 751 article groups, with a 526/75/150
4https://pandoc.org/
5https://www.figure-eight.com/
Operation at Document-level Count
# of sent. insertion (0-to-1) 25,229
# of sent. deletion (1-to-0) 17,315
# of sent. rephrasing (1-to-1) 17,755
# of sent. splitting (1-to-n) 378
# of sent. merging (n-to-1) 269
# of sent. fusion (m-to-n) 142
# of sent. copying (1-to-1) 95,110
Table 1: Statistics of document-level revision in our
ARXIVEDITS corpus, based on manually annotated
sentence alignment.
split for train/dev/test sets in the experiments of
automatic sentence alignment in
$
4. The inter-
annotator agreement is 0.614 measured by Cohen’s
kappa (Artstein and Poesio,2008). To verify the
crowd-sourcing annotation quality, an in-house an-
notator manually aligns sentences for 10 randomly
sampled groups with 14 article pairs. If assuming
the in-house annotation is gold, the majority vote
of crowd-sourcing annotation achieves an F1 of
94.2 on these 10 paper groups.
2.3 Fine-grained Edits with Varied Intentions
Sentence-level revision involves the insertion, dele-
tion, substitution, and reordering of words and
phrases. Multiple edits may be tangled together in
one sentence, while each edit is made for different
purposes (see an example in Figure 1). Correctly
detecting and classifying these edits is a challeng-
ing problem. We first introduce the formal defini-
tion of edits and our proposed intention taxonomy,
followed by the annotation procedure.
Definition of Span-level Edits.
A sentence-level
revision
R
consists of the original sentence
s
, tar-
get sentence
t
, and a series of fine-grained edits
ei
. Each edit
ei
is defined as a tuple
(sa:b,tc:d,I)
,
indicating span
[sa, sa+1, ..., sb]
in the original sen-
tence is transformed into span
[tc, tc+1, ..., td]
in
the target sentence, with an intention label
I∈ I
(defined in Table 2). The type of edit can be recog-
nized by spans
sa:b
and
tc:d
, where
sa:b= [NULL]
indicating insertion,
tc:d= [NULL]
for deletion,
sa:b=tc:d
representing reordering, and
sa:b6=tc:d
for substitution.
Edit Intention Taxonomy.
We propose a new
taxonomy to comprehensively capture the intention
of text revision in the scientific writing domain, as
shown in Table 2. Each edit is classified into one of
the following categories: Improve Language,Cor-
rect Grammar/Typo,Update Content, and Adjust
Format. Since our goal is to improve the writing
Intention Label Definition Example %
Improve Language 28.6%
More Accurate/Specific Minor adjustment to improve the accuracy or
specificness of the description.
Further, we suggest a relativistic-invariant protocol
for quantum information processing communication. 11.5%
Improve Style Make the text sound more professional or
coherent without altering the meaning.
. . . due to hydrodynamic interactions among cells
in addition with besides self-generated force . . . 8.7%
Simplify Simplify complex concepts or delete redundant
content to improve readability.
These include new transceiver architecture ( TXRU
array connected architecture ) . . . 7.6%
Other Other language improvements that don’t fall into
the above categories.
. . . due to changes in fuels used , or , in other words ,
associated to changes of technologies . 0.8%
Correct Grammar/Typo Fix grammatical errors, correct typos, or smooth
out grammar needed by other changes.
Not Note that the investigator might reconstruct each
function . . . 25.4%
Update Content Update large amount of scientific content, add or
delete major fact.
. . . characterized by long range hydrodynamic term
and self-generated force due to actin remodeling. 28.8%
Adjust Format Adjust table, figure, equation, reference, citation,
and punctuation etc.
Similarly to what we did in Figure Fig. [REF] , the
statistical results obtained by means of . . . 17.2%
Table 2: A taxonomy (I) of edit intentions in scientific writing revisions. In each example, text with red background
denotes the edit. Span with strike-through means the content got deleted, otherwise is inserted.
quality, we further break the Improve Language
type into four fine-grained categories. During the
design, we extensively consult prior literature in
text revision (Faigley and Witte,1981;Fitzger-
ald,1987;Daxenberger,2016), edit categorization
(Bronner and Monz,2012;Yang et al.,2017), and
analysis in related areas such as Wikipedia (Dax-
enberger and Gurevych,2013) and argumentative
essays (Zhang et al.,2017). The taxonomy is im-
proved for several rounds based on the feedback
from four NLP researchers and two in-house anno-
tators with linguistic background.
Annotating Edits.
In pilot study, we found that
directly annotating fine-grained edits is a tedious
and complicated task for annotators, as it requires
separating and matching edited spans across two
sentences. To assist the annotators, we use mono-
lingual word alignment (Lan et al.,2021), which
can find the correspondences between words and
phrases with a similar meaning in two sentences,
as an intermediate step to reduce the cognitive load
during annotation. We find that, compared to strict
word-to-word matching, edits usually have larger
granularity and may cross linguistic boundaries.
For example, in Figure 1, “corresponding to” and
“correspondence with” should be treated as a whole
to be meaningful and labeled an intention. There-
fore, the edits can be annotated by adjusting the
boundaries of the span alignment. We propose the
following 2-step method that leverages word align-
ment to assist the annotation of edits:
1.
Collect word alignment annotation by asking
in-house annotators to manually correct the
automatic word alignment generated by the
neural semi-CRF word alignment model (Lan
et al.,2021). The aligner is trained on the
MTRef dataset and achieves state-of-the-art
performance on the monolingual word align-
ment task with 92.4 F1.
2.
Annotate edits by having in-house annotators
inspect and correct the fine-grained edits that
are extracted from word alignment using sim-
ple heuristics. The heuristics are detailed in
$
4.1. Two principles are followed during the
correction: (1) Each edit should have a clear
intention and relatively clear phrase bound-
aries; (2) Span pairs in substitution should
be semantically related, otherwise should be
treated as separated insertion and deletion.
We manually annotate insertion, deletion, substi-
tution, and derive reordering automatically, since
it can be reliably found by heuristics. Due to the
slight variance in granularity, it is possible that
more than one answer is acceptable. Therefore, we
include all the alternative edits for sentence pairs
in the dev and test sets in our annotation, among
which 16% have more than one answer.
Overall, we found that our method can annotate
more accurate and fine-grained edits compared to
prior work that uses the
diff
algorithm. The
diff
method is based on minimizing the edit distance
regardless of semantic meaning. Therefore, the
extracted edits are coarse-grained and may contain
many errors (detailed in Table 3).
Annotating Intention.
As intentions can differ
subtly, correctly identifying them is a challenging
摘要:

ARXIVEDITS:UnderstandingtheHumanRevisionProcessinScienticWritingChaoJiang1,WeiXu1,SamuelStevens21SchoolofInteractiveComputing,GeorgiaInstituteofTechnology2DepartmentofComputerScienceandEngineering,OhioStateUniversitychaojiang@gatech.eduwei.xu@cc.gatech.edustevens.994@osu.eduAbstractScienticpublic...

展开>> 收起<<
ARXIVEDITS Understanding the Human Revision Process in Scientific Writing Chao Jiang1 Wei Xu1 Samuel Stevens2.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注