PreprintMatch a tool for preprint publication detection applied to analyze global inequities in scientific publishing Peter Eckmann1 Anita Bandrowski2

2025-05-06 0 0 3.55MB 21 页 10玖币

侵权投诉

PreprintMatch: a tool for preprint publication detection

applied to analyze global inequities in scientific publishing

Peter Eckmann1, Anita Bandrowski2

1Department of Computer Science and Engineering, UC San Diego, La Jolla, CA,

United States

2Department of Neuroscience, UC San Diego, La Jolla, CA, United States

* abandrowski@ucsd.edu

Abstract

Preprints, versions of scientific manuscripts that precede peer review, are growing in

popularity. They offer an opportunity to democratize and accelerate research, as they

have no publication costs or a lengthy peer review process. Preprints are often later

published in peer-reviewed venues, but these publications and the original preprints are

frequently not linked in any way. To this end, we developed a tool, PreprintMatch, to

find matches between preprints and their corresponding published papers, if they exist.

This tool outperforms existing techniques to match preprints and papers, both on

matching performance and speed. PreprintMatch was applied to search for matches

between preprints (from bioRxiv and medRxiv), and PubMed. The preliminary nature

of preprints offers a unique perspective into scientific projects at a relatively early stage,

and with better matching between preprint and paper, we explored questions related to

research inequity. We found that preprints from low income countries are published as

peer-reviewed papers at a lower rate than high income countries (39.6% and 61.1%,

respectively), and our data is consistent with previous work that cite a lack of resources,

lack of stability, and policy choices to explain this discrepancy. Preprints from low

income countries were also found to be published quicker (178 vs 203 days) and with

less title, abstract, and author similarity to the published version compared to high

income countries. Low income countries add more authors from the preprint to the

published version than high income countries (0.42 authors vs 0.32, respectively), a

practice that is significantly more frequent in China compared to similar countries.

Finally, we find that some publishers publish work with authors from lower income

countries more frequently than others. PreprintMatch is available at

https://github.com/PeterEckmann1/preprint-match (RRID:SCR 022302), and

data at https://zenodo.org/record/4679875.

Introduction

Preprints are versions of scientific manuscripts that precede formal peer review [1].

They are available as open access from a number of preprint servers, which together

span physics [2, 3], mathematics [4], computer science [5], biology, and medicine [6, 7].

Authors are able to publish their preprints for free, because servers are maintained by

institutions or foundations. Servers, such as arXiv, often perform a very permissive

scientific relevance check, but do not check for methodological soundness or perform any

other sort of peer review [8,9]. Supporters of preprints claim they make the publication

October 6, 2022 1/21

arXiv:2210.01933v1 [cs.DL] 4 Oct 2022

of important results faster, democratize scientific publishing, and make public criticism

possible, allowing papers to be further vetted by the community instead of a select

group of peer reviewers [10–13]. Skeptics, on the other hand, worry that unvetted

scientific documents released into the public domain risk spreading falsities and push

out niche groups and topics from the greater research enterprise altogether [14–16].

ArXiv, one of the first preprint servers, was launched in 1991 to make the sharing of

high-energy physics manuscripts easier among colleagues [3]. It began as an email server

hosted on a single computer in Los Alamos National Laboratory that sent out

manuscripts to a select mailing list. Within a few years, arXiv was turned into a web

resource. Other fields, like condensed-matter physics, and later computer science and

mathematics, began using arXiv and eventually adopted it as their primary form of

communication. Ginsparg (2011) [3], the founder of arXiv, believes its growth has

helped to democratize science in the fields that have adopted it. Indeed, many of the

previously mentioned fields now use arXiv as their primary source of scholarly

communication [2, 17, 18].

Inspired by arXiv, bioRxiv was launched in 2013 as a preprint server focused

specifically on the biological sciences [6]. The sister server to bioRxiv, medRxiv, was

launched in 2019 [19]. Together, these servers contain over 160,000 biomedical

preprints [20, 21], a number which continues to grow rapidly [6]. This initial growth was

greatly accelerated by the COVID-19 pandemic, where fast dissemination of research

was critical [22] and together, these servers now have over 20,000 COVID-19 related

preprints [23].

Despite the widespread adoption of arXiv in many fields, biology and medicine has

been slow to adopt preprints beyond their use in a pandemic [7, 10,24]. While the utility

of preprints during a pandemic is especially clear, e.g. a quick time to publication,

biomedicine in general tends to still rely on peer reviewed work [25, 26] despite the early

growth of preprint servers in the field. As an example of this hesitancy, the advisory

board behind the conception of PubMedCentral, a free full-text archive for biomedicine,

elected to disallow the posting of non-referried works, despite the knowledge of the

success of an arXiv model, in fear of losing publisher support [3, 27]. More recently,

however, the National Institutes of Health (NIH) allowed researchers to claim preprints

as interim research products in grant applications [28], indicating a level of support for

preprints they have not had in the past. As much of the biomedical research community

relies on the NIH for funding, this is an important step toward greater adoption of

preprints [7]. Another important step is integration of preprints into the primary

database for biomedical research, PubMed. As recently as 2020, NIH has begun a

preprint pilot to index preprints in PubMedCentral and by extension, PubMed [29].

The preliminary nature of preprints offers a unique perspective toward the

development of scientific projects. The relatively lower economic and time barrier to

posting means that work is made available earlier in a project’s development, and may

even be the only public output of a project [30, 31]. The low barrier to entry for

preprints could be particularly powerful for developing countries, where lack of financial

resources for publication and lack of institutional library support makes research,

especially that published in peer-reviewed channels, more difficult [32–34].

It is well known that developing countries are underrepresented in the research

world [35–37], and increasing research output from developing countries may be

beneficial to their economic development [38]. It has been suggested that low research

output stems from high publication costs, lack of institutional support, lack of external

funding, bias, high teaching burden, and language issues [35, 39–46]. The open access

movement promises to overcome some of these issues by making research widely

available to researchers that do not have institutional support [34]. Projects like

SciELO aim to increase visibility of open access works from developing countries [47],

October 6, 2022 2/21

especially non English-speaking ones [48], but the works they publicize are often still

published in high-cost journals [46, 47].

Much of the proposed reasoning for why developing countries are underrepresented

in research is based on interviews of researchers from developing countries and analysis

of policy [49–51]. Many studies that seek to explain the lack of research in developing

countries are qualitative [35, 42, 52

–

57], but we found few quantitative references. Since

there is a lack of quantitative studies, we wanted to explore the lack of research in

developing countries in a quantitative fashion through preprints, a rich source of

research data from these countries.

Such an analysis of conversion from preprint to paper is not trivial, as we must know

which published work corresponds to which preprint, or when no published work exists

at all. Preprint servers like bioRxiv attempt to link preprints and papers, but, wanting

to avoid incorrectly matching works, use strict rules that may miss published works that

significantly differ from the original preprint [9]. Indeed, Abdill and Blekhman [58]

suggest that bioRxiv does not report up to 53% of preprints that are later published as

papers. Therefore, although reasonable on the platform level, using bioRxiv’s reported

publications are not entirely useful for an analysis into preprint to paper conversion, as

we miss many published works, especially in the interesting case where a work changes

significantly from preprint to publication.

Cabanac et al. [59], Fraser et al. [30], Serghio and Ioannidis [60], and Fu and

Hughey [61] have analyzed preprint and paper pairs beyond what bioRxiv reports.

Fraser et al. and Cabanac et al. query an API multiple times, which is both sensitive

and specific, but takes a long time per preprint, meaning these tools are less suitable for

a large-scale analysis. Serghio and Ioannidis and Fu and Hughey use Crossref’s reported

matches, which are generated based on bioRxiv’s own matches, and therefore have the

same specificity limitations [58, 62]. Our contributions in this paper are twofold: we

present a new tool, PreprintMatch, to match preprints and papers with high efficiency,

and compare our tool to previous techniques. Our code is available at

https://github.com/PeterEckmann1/preprint-match. We use this tool to explore

preprints as a window into global biomedical research.

Materials and methods

PreprintMatch

Preprints

The preprint servers bioRxiv and medRxiv were used for preprint data. These two sister

servers were chosen for their size as among the largest biomedical preprint servers [63],

the easy availability of data, and their English language-only policy [9], which allowed

for valid comparison with PubMed’s large subset of English papers. Additionally,

bioRxiv and medRxiv search for and manually confirm preprints that are published as

papers, although their search is not rigorous.

Preprint metadata was obtained from the Rxivist platform [58]. All preprints from

both bioRxiv and medRxiv published from the inception of bioRxiv (November 11,

2013) to the date of the May 4, 2021 Rxivist database dump

(doi:10.5281/zenodo.4738007) were included in our analyses. Data is from the most

recent version of all preprints as of May 4, 2021. Data was downloaded through a

Docker container (https://hub.docker.com/r/blekhmanlab/rxivist_data), and

accessed via a pgAdmin runtime.

October 6, 2022 3/21

Published papers

PubMed, the primary database for biomedical research, was used to extract metadata

for published papers. While it does not cover the entire biomedical literature, PubMed

indexes the vast majority of biomedical journals and is the only source for the needed

large volume of high-quality metadata. Metadata for all papers that were published

between the date of the earliest paper indexed by PubMed (May 1, 1979) to December

12, 2020 were downloaded and parsed. Data was obtained through XML files available

at the NLM FTP server

(https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/190101/index.html,

RRID:SCR 004846). XMLs were parsed with lxml version 4.4.2.0, and the DOI, PMID,

title, abstract, publication date, and authors were extracted for each article. While

PubMed provides multiple article dates, the

field was used for this analysis,

as it is the date that the work becomes available to the public. The exact date format

often varied across journals, so a function was written to canonicalize the date

(https://github.com/PeterEckmann1/preprint-match/blob/master/matcher/

data_sources.py#L16). As we needed exact dates for time period calculations, we

followed the following rules: when no day could be found in the date of publication

string it was assumed to be the first of the month, and when no month could be found

it was assumed to be January. Only a small fraction of papers had to apply these rules.

Abstracts were parsed from the

field, titles from

and DOIs from

. These fields were stored in a single table

in a local PostgreSQL 13 Docker container, along with a primary key column. The

author names were extracted from the XML’s

, and for each

that list a single string was constructed, using the format

"<ForeName> <LastName>"

a<ForeName> was present, otherwise just "<LastName>". This representation follows

the specification given in

https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html, and for

our purposes, was sufficient for the vast majority of papers.

Additionally, we excluded all papers with a language other than English (i.e. the

<Language> tag did not contain "eng"), as bioRxiv and medRxiv only accept English

submissions [9]. We assume that non-English publications rarely arise from English

bioRxiv/medRxiv submissions, and any that do would be very difficult to match using

semantic similarity techniques. Finally, we also excluded all papers with any of the

following types (i.e.

), as they are also unlikely to be the published

version of a preprint (or are the preprint itself, which we exclude): {Comment,

Published Erratum, Review, Preprint}.

PreprintMatch description

PreprintMatch uses a set of similarity measures and hard-coded rules to find the

published version of a given preprint if it exists, and returns a null result otherwise.

PreprintMatch operates under the assumption that for a given preprint, the highest

similarity paper is its published version. Simple similarity measures, such as

bag-of-words similarity, are adopted by many existing methods but do not capture cases

where the exact choice of words in the preprint and published abstract are different, but

the meaning is the same. This is central when matching across versions, as authors may

use different word choices as they polish their writing but the underlying meaning of the

paper does not change. Therefore, we adopt a more sophisticated semantic measure of

similarity using word vector representations.

For computing similarity, we use titles, abstracts, and author lists. Many previous

works do not use abstracts due to speed, space, and availability limitations, but we are

able to include abstracts in our similarity measures with the development of a custom

October 6, 2022 4/21

database system optimized for large similarity queries. This allows us to incorporate

additional information from abstracts, allowing us to achieve state-of-the-art

performance characteristics.

First, the title and abstracts, as obtained from the results above, are transformed

into their word vector representation using fastText ( [64], RRID:SCR 022301), a library

for learning and generating word vector representation. We trained word vectors, as

opposed to using a pretrained model, because many general English pretrained models

do not contain vectors for domain-specific words, e.g. disease names. While language

models that were trained on scientific text exist, like SciBERT [65], they were too slow

for our purposes. Therefore, we used fastText to generate a set of domain-relevant

vectors and retain high speeds. fastText additionally uses word stems to guess vectors

for words not present in the training set, which is often a problem when dealing with

highly specific terms that can be found in the biomedical literature. We train our word

vectors by taking a random sample of 10% (using PostgreSQL’s tablesample system) of

all abstracts and titles, and train vectors for both titles and abstracts independently.

We train with default hyperparameters and a vector dimension of 300. After training,

we used the model to map all abstracts and titles present in our published paper

database to vectors using fastText’s

to sentence vector()

function, which computes a

normalized average across all word vectors present in the abstract or title. For preprints

entered into PreprintMatch, we compute the abstract and title vectors again using the

same process.

After we obtain vectors for the preprint and published papers, cosine similarity is

used to measure similarity. For speed reasons, we save all abstract and title vectors in

separate NumPy (RRID:SCR 008633) matrix files, which are loaded with

np.memmap()

Then, we use a custom Numba [66] function to compute cosine similarity between the

preprint’s vectors and all vectors in our dataset of published papers. The union of the

top 100 most similar paper titles and abstracts for each preprint is calculated, and

author information is fetched from our local database for these 100 papers. Author

similarity is computed as the Jaccard similarity between the preprint and paper author

sets. Two authors are considered to be matching if their last names match exactly. We

also check the published paper author last names against the preprint author first

names, as we observed author first and last names being flipped occasionally in

preprints. We take whichever of these two scores are higher. Having computed author,

title, and abstract similarities for the 100 most promising paper candidates, a classifier

is used to declare a match or not, taking into account title and abstract cosine similarity

and Jaccard similarity between authors. We trained a support vector machine (SVM)

for this task, using a hand-curated set of 100 matching, and 100 non-matching

preprint-paper pairs. The matching paper pairs were obtained from a random sample of

100 bioRxiv-announced matches, and the non-matching pairs were obtained by finding

the published paper for a random set of bioRxiv preprints with no announced

publication with the highest product of abstract and title similarity, and manually

removing papers that were true matches. This procedure allowed us to obtain pairs

with high similarity, but not a match, so that our SVM is trained to distinguish between

papers that have high similarities. Our SVM was trained to predict a binary label given

3 inputs (title, abstract, and author similarity) using sklearn.svm.SVC with default

hyperparameters. While this procedure achieves high accuracy and recall, we improve it

slightly with additional hard-coding rules that were chosen after failure analysis. When

the SVM gives a negative result for one of the 100 pairs, a match is still declared if the

first 7 words of the abstracts, or the text before a colon in the title, matches exactly

between the preprint and paper. This is to capture cases when the abstract changes

significantly, but the introduction stays exactly the same, and when a specific tool or

method is named in the title (e.g. <Tool name>: a tool for ...), respectively.

October 6, 2022 5/21

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PreprintMatch:atoolforpreprintpublicationdetectionappliedtoanalyzeglobalinequitiesinscientificpublishingPeterEckmann1,AnitaBandrowski21DepartmentofComputerScienceandEngineering,UCSanDiego,LaJolla,CA,UnitedStates2DepartmentofNeuroscience,UCSanDiego,LaJolla,CA,UnitedStates*abandrowski@ucsd.eduAbstract...

展开>> 收起<<

PreprintMatch a tool for preprint publication detection applied to analyze global inequities in scientific publishing Peter Eckmann1 Anita Bandrowski2.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PreprintMatch a tool for preprint publication detection applied to analyze global inequities in scientific publishing Peter Eckmann1 Anita Bandrowski2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: