PreprintMatch a tool for preprint publication detection applied to analyze global inequities in scientific publishing Peter Eckmann1 Anita Bandrowski2

2025-05-06 0 0 3.55MB 21 页 10玖币
侵权投诉
PreprintMatch: a tool for preprint publication detection
applied to analyze global inequities in scientific publishing
Peter Eckmann1, Anita Bandrowski2
1Department of Computer Science and Engineering, UC San Diego, La Jolla, CA,
United States
2Department of Neuroscience, UC San Diego, La Jolla, CA, United States
* abandrowski@ucsd.edu
Abstract
Preprints, versions of scientific manuscripts that precede peer review, are growing in
popularity. They offer an opportunity to democratize and accelerate research, as they
have no publication costs or a lengthy peer review process. Preprints are often later
published in peer-reviewed venues, but these publications and the original preprints are
frequently not linked in any way. To this end, we developed a tool, PreprintMatch, to
find matches between preprints and their corresponding published papers, if they exist.
This tool outperforms existing techniques to match preprints and papers, both on
matching performance and speed. PreprintMatch was applied to search for matches
between preprints (from bioRxiv and medRxiv), and PubMed. The preliminary nature
of preprints offers a unique perspective into scientific projects at a relatively early stage,
and with better matching between preprint and paper, we explored questions related to
research inequity. We found that preprints from low income countries are published as
peer-reviewed papers at a lower rate than high income countries (39.6% and 61.1%,
respectively), and our data is consistent with previous work that cite a lack of resources,
lack of stability, and policy choices to explain this discrepancy. Preprints from low
income countries were also found to be published quicker (178 vs 203 days) and with
less title, abstract, and author similarity to the published version compared to high
income countries. Low income countries add more authors from the preprint to the
published version than high income countries (0.42 authors vs 0.32, respectively), a
practice that is significantly more frequent in China compared to similar countries.
Finally, we find that some publishers publish work with authors from lower income
countries more frequently than others. PreprintMatch is available at
https://github.com/PeterEckmann1/preprint-match (RRID:SCR 022302), and
data at https://zenodo.org/record/4679875.
Introduction
Preprints are versions of scientific manuscripts that precede formal peer review [1].
They are available as open access from a number of preprint servers, which together
span physics [2, 3], mathematics [4], computer science [5], biology, and medicine [6, 7].
Authors are able to publish their preprints for free, because servers are maintained by
institutions or foundations. Servers, such as arXiv, often perform a very permissive
scientific relevance check, but do not check for methodological soundness or perform any
other sort of peer review [8,9]. Supporters of preprints claim they make the publication
October 6, 2022 1/21
arXiv:2210.01933v1 [cs.DL] 4 Oct 2022
of important results faster, democratize scientific publishing, and make public criticism
possible, allowing papers to be further vetted by the community instead of a select
group of peer reviewers [10–13]. Skeptics, on the other hand, worry that unvetted
scientific documents released into the public domain risk spreading falsities and push
out niche groups and topics from the greater research enterprise altogether [14–16].
ArXiv, one of the first preprint servers, was launched in 1991 to make the sharing of
high-energy physics manuscripts easier among colleagues [3]. It began as an email server
hosted on a single computer in Los Alamos National Laboratory that sent out
manuscripts to a select mailing list. Within a few years, arXiv was turned into a web
resource. Other fields, like condensed-matter physics, and later computer science and
mathematics, began using arXiv and eventually adopted it as their primary form of
communication. Ginsparg (2011) [3], the founder of arXiv, believes its growth has
helped to democratize science in the fields that have adopted it. Indeed, many of the
previously mentioned fields now use arXiv as their primary source of scholarly
communication [2, 17, 18].
Inspired by arXiv, bioRxiv was launched in 2013 as a preprint server focused
specifically on the biological sciences [6]. The sister server to bioRxiv, medRxiv, was
launched in 2019 [19]. Together, these servers contain over 160,000 biomedical
preprints [20, 21], a number which continues to grow rapidly [6]. This initial growth was
greatly accelerated by the COVID-19 pandemic, where fast dissemination of research
was critical [22] and together, these servers now have over 20,000 COVID-19 related
preprints [23].
Despite the widespread adoption of arXiv in many fields, biology and medicine has
been slow to adopt preprints beyond their use in a pandemic [7, 10,24]. While the utility
of preprints during a pandemic is especially clear, e.g. a quick time to publication,
biomedicine in general tends to still rely on peer reviewed work [25, 26] despite the early
growth of preprint servers in the field. As an example of this hesitancy, the advisory
board behind the conception of PubMedCentral, a free full-text archive for biomedicine,
elected to disallow the posting of non-referried works, despite the knowledge of the
success of an arXiv model, in fear of losing publisher support [3, 27]. More recently,
however, the National Institutes of Health (NIH) allowed researchers to claim preprints
as interim research products in grant applications [28], indicating a level of support for
preprints they have not had in the past. As much of the biomedical research community
relies on the NIH for funding, this is an important step toward greater adoption of
preprints [7]. Another important step is integration of preprints into the primary
database for biomedical research, PubMed. As recently as 2020, NIH has begun a
preprint pilot to index preprints in PubMedCentral and by extension, PubMed [29].
The preliminary nature of preprints offers a unique perspective toward the
development of scientific projects. The relatively lower economic and time barrier to
posting means that work is made available earlier in a project’s development, and may
even be the only public output of a project [30, 31]. The low barrier to entry for
preprints could be particularly powerful for developing countries, where lack of financial
resources for publication and lack of institutional library support makes research,
especially that published in peer-reviewed channels, more difficult [32–34].
It is well known that developing countries are underrepresented in the research
world [35–37], and increasing research output from developing countries may be
beneficial to their economic development [38]. It has been suggested that low research
output stems from high publication costs, lack of institutional support, lack of external
funding, bias, high teaching burden, and language issues [35, 39–46]. The open access
movement promises to overcome some of these issues by making research widely
available to researchers that do not have institutional support [34]. Projects like
SciELO aim to increase visibility of open access works from developing countries [47],
October 6, 2022 2/21
especially non English-speaking ones [48], but the works they publicize are often still
published in high-cost journals [46, 47].
Much of the proposed reasoning for why developing countries are underrepresented
in research is based on interviews of researchers from developing countries and analysis
of policy [49–51]. Many studies that seek to explain the lack of research in developing
countries are qualitative [35, 42, 52
57], but we found few quantitative references. Since
there is a lack of quantitative studies, we wanted to explore the lack of research in
developing countries in a quantitative fashion through preprints, a rich source of
research data from these countries.
Such an analysis of conversion from preprint to paper is not trivial, as we must know
which published work corresponds to which preprint, or when no published work exists
at all. Preprint servers like bioRxiv attempt to link preprints and papers, but, wanting
to avoid incorrectly matching works, use strict rules that may miss published works that
significantly differ from the original preprint [9]. Indeed, Abdill and Blekhman [58]
suggest that bioRxiv does not report up to 53% of preprints that are later published as
papers. Therefore, although reasonable on the platform level, using bioRxiv’s reported
publications are not entirely useful for an analysis into preprint to paper conversion, as
we miss many published works, especially in the interesting case where a work changes
significantly from preprint to publication.
Cabanac et al. [59], Fraser et al. [30], Serghio and Ioannidis [60], and Fu and
Hughey [61] have analyzed preprint and paper pairs beyond what bioRxiv reports.
Fraser et al. and Cabanac et al. query an API multiple times, which is both sensitive
and specific, but takes a long time per preprint, meaning these tools are less suitable for
a large-scale analysis. Serghio and Ioannidis and Fu and Hughey use Crossref’s reported
matches, which are generated based on bioRxiv’s own matches, and therefore have the
same specificity limitations [58, 62]. Our contributions in this paper are twofold: we
present a new tool, PreprintMatch, to match preprints and papers with high efficiency,
and compare our tool to previous techniques. Our code is available at
https://github.com/PeterEckmann1/preprint-match. We use this tool to explore
preprints as a window into global biomedical research.
Materials and methods
PreprintMatch
Preprints
The preprint servers bioRxiv and medRxiv were used for preprint data. These two sister
servers were chosen for their size as among the largest biomedical preprint servers [63],
the easy availability of data, and their English language-only policy [9], which allowed
for valid comparison with PubMed’s large subset of English papers. Additionally,
bioRxiv and medRxiv search for and manually confirm preprints that are published as
papers, although their search is not rigorous.
Preprint metadata was obtained from the Rxivist platform [58]. All preprints from
both bioRxiv and medRxiv published from the inception of bioRxiv (November 11,
2013) to the date of the May 4, 2021 Rxivist database dump
(doi:10.5281/zenodo.4738007) were included in our analyses. Data is from the most
recent version of all preprints as of May 4, 2021. Data was downloaded through a
Docker container (https://hub.docker.com/r/blekhmanlab/rxivist_data), and
accessed via a pgAdmin runtime.
October 6, 2022 3/21
Published papers
PubMed, the primary database for biomedical research, was used to extract metadata
for published papers. While it does not cover the entire biomedical literature, PubMed
indexes the vast majority of biomedical journals and is the only source for the needed
large volume of high-quality metadata. Metadata for all papers that were published
between the date of the earliest paper indexed by PubMed (May 1, 1979) to December
12, 2020 were downloaded and parsed. Data was obtained through XML files available
at the NLM FTP server
(https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/190101/index.html,
RRID:SCR 004846). XMLs were parsed with lxml version 4.4.2.0, and the DOI, PMID,
title, abstract, publication date, and authors were extracted for each article. While
PubMed provides multiple article dates, the
<PubDate>
field was used for this analysis,
as it is the date that the work becomes available to the public. The exact date format
often varied across journals, so a function was written to canonicalize the date
(https://github.com/PeterEckmann1/preprint-match/blob/master/matcher/
data_sources.py#L16). As we needed exact dates for time period calculations, we
followed the following rules: when no day could be found in the date of publication
string it was assumed to be the first of the month, and when no month could be found
it was assumed to be January. Only a small fraction of papers had to apply these rules.
Abstracts were parsed from the
<AbstractText>
field, titles from
<ArticleTitle>
,
and DOIs from
<ArticleId IdType="doi">
. These fields were stored in a single table
in a local PostgreSQL 13 Docker container, along with a primary key column. The
author names were extracted from the XML’s
<AuthorList>
, and for each
<Author>
in
that list a single string was constructed, using the format
"<ForeName> <LastName>"
if
a<ForeName> was present, otherwise just "<LastName>". This representation follows
the specification given in
https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html, and for
our purposes, was sufficient for the vast majority of papers.
Additionally, we excluded all papers with a language other than English (i.e. the
<Language> tag did not contain "eng"), as bioRxiv and medRxiv only accept English
submissions [9]. We assume that non-English publications rarely arise from English
bioRxiv/medRxiv submissions, and any that do would be very difficult to match using
semantic similarity techniques. Finally, we also excluded all papers with any of the
following types (i.e.
<PublicationType>
), as they are also unlikely to be the published
version of a preprint (or are the preprint itself, which we exclude): {Comment,
Published Erratum, Review, Preprint}.
PreprintMatch description
PreprintMatch uses a set of similarity measures and hard-coded rules to find the
published version of a given preprint if it exists, and returns a null result otherwise.
PreprintMatch operates under the assumption that for a given preprint, the highest
similarity paper is its published version. Simple similarity measures, such as
bag-of-words similarity, are adopted by many existing methods but do not capture cases
where the exact choice of words in the preprint and published abstract are different, but
the meaning is the same. This is central when matching across versions, as authors may
use different word choices as they polish their writing but the underlying meaning of the
paper does not change. Therefore, we adopt a more sophisticated semantic measure of
similarity using word vector representations.
For computing similarity, we use titles, abstracts, and author lists. Many previous
works do not use abstracts due to speed, space, and availability limitations, but we are
able to include abstracts in our similarity measures with the development of a custom
October 6, 2022 4/21
database system optimized for large similarity queries. This allows us to incorporate
additional information from abstracts, allowing us to achieve state-of-the-art
performance characteristics.
First, the title and abstracts, as obtained from the results above, are transformed
into their word vector representation using fastText ( [64], RRID:SCR 022301), a library
for learning and generating word vector representation. We trained word vectors, as
opposed to using a pretrained model, because many general English pretrained models
do not contain vectors for domain-specific words, e.g. disease names. While language
models that were trained on scientific text exist, like SciBERT [65], they were too slow
for our purposes. Therefore, we used fastText to generate a set of domain-relevant
vectors and retain high speeds. fastText additionally uses word stems to guess vectors
for words not present in the training set, which is often a problem when dealing with
highly specific terms that can be found in the biomedical literature. We train our word
vectors by taking a random sample of 10% (using PostgreSQL’s tablesample system) of
all abstracts and titles, and train vectors for both titles and abstracts independently.
We train with default hyperparameters and a vector dimension of 300. After training,
we used the model to map all abstracts and titles present in our published paper
database to vectors using fastText’s
to sentence vector()
function, which computes a
normalized average across all word vectors present in the abstract or title. For preprints
entered into PreprintMatch, we compute the abstract and title vectors again using the
same process.
After we obtain vectors for the preprint and published papers, cosine similarity is
used to measure similarity. For speed reasons, we save all abstract and title vectors in
separate NumPy (RRID:SCR 008633) matrix files, which are loaded with
np.memmap()
.
Then, we use a custom Numba [66] function to compute cosine similarity between the
preprint’s vectors and all vectors in our dataset of published papers. The union of the
top 100 most similar paper titles and abstracts for each preprint is calculated, and
author information is fetched from our local database for these 100 papers. Author
similarity is computed as the Jaccard similarity between the preprint and paper author
sets. Two authors are considered to be matching if their last names match exactly. We
also check the published paper author last names against the preprint author first
names, as we observed author first and last names being flipped occasionally in
preprints. We take whichever of these two scores are higher. Having computed author,
title, and abstract similarities for the 100 most promising paper candidates, a classifier
is used to declare a match or not, taking into account title and abstract cosine similarity
and Jaccard similarity between authors. We trained a support vector machine (SVM)
for this task, using a hand-curated set of 100 matching, and 100 non-matching
preprint-paper pairs. The matching paper pairs were obtained from a random sample of
100 bioRxiv-announced matches, and the non-matching pairs were obtained by finding
the published paper for a random set of bioRxiv preprints with no announced
publication with the highest product of abstract and title similarity, and manually
removing papers that were true matches. This procedure allowed us to obtain pairs
with high similarity, but not a match, so that our SVM is trained to distinguish between
papers that have high similarities. Our SVM was trained to predict a binary label given
3 inputs (title, abstract, and author similarity) using sklearn.svm.SVC with default
hyperparameters. While this procedure achieves high accuracy and recall, we improve it
slightly with additional hard-coding rules that were chosen after failure analysis. When
the SVM gives a negative result for one of the 100 pairs, a match is still declared if the
first 7 words of the abstracts, or the text before a colon in the title, matches exactly
between the preprint and paper. This is to capture cases when the abstract changes
significantly, but the introduction stays exactly the same, and when a specific tool or
method is named in the title (e.g. <Tool name>: a tool for ...), respectively.
October 6, 2022 5/21
摘要:

PreprintMatch:atoolforpreprintpublicationdetectionappliedtoanalyzeglobalinequitiesinscientificpublishingPeterEckmann1,AnitaBandrowski21DepartmentofComputerScienceandEngineering,UCSanDiego,LaJolla,CA,UnitedStates2DepartmentofNeuroscience,UCSanDiego,LaJolla,CA,UnitedStates*abandrowski@ucsd.eduAbstract...

展开>> 收起<<
PreprintMatch a tool for preprint publication detection applied to analyze global inequities in scientific publishing Peter Eckmann1 Anita Bandrowski2.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:21 页 大小:3.55MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注