Geographic Citation Gaps in NLP Research Mukund Rungtaµº Janvijay Singhµº Saif M. Mohammad Diyi Yang µSchool of Interactive Computing Georgia Institute of Technology

2025-05-06 0 0 3.79MB 13 页 10玖币
侵权投诉
Geographic Citation Gaps in NLP Research
Mukund Rungta, Janvijay Singh♢♣, Saif M. Mohammad, Diyi Yang
School of Interactive Computing, Georgia Institute of Technology
National Research Council Canada
Stanford University
{mrungta8, iamjanvijay}@gatech.edu
saif.mohammad@nrc-cnrc.gc.ca
diyiy@cs.stanford.edu
Abstract
In a fair world, people have equitable oppor-
tunities to education, to conduct scientific re-
search, to publish, and to get credit for their
work, regardless of where they live. However,
it is common knowledge among researchers
that a vast number of papers accepted at top
NLP venues come from a handful of western
countries and (lately) China; whereas, very
few papers from Africa and South America
get published. Similar disparities are also be-
lieved to exist for paper citation counts. In the
spirit of “what we do not measure, we cannot
improve”, this work asks a series of questions
on the relationship between geographical loca-
tion and publication success (acceptance in top
NLP venues and citation impact). We first cre-
ated a dataset of 70,000 papers from the ACL
Anthology, extracted their meta-information,
and generated their citation network. We then
show that not only are there substantial geo-
graphical disparities in paper acceptance and
citation but also that these disparities persist
even when controlling for a number of vari-
ables such as venue of publication and sub-
field of NLP. Further, despite some steps taken
by the NLP community to improve geograph-
ical diversity, we show that the disparity in
publication metrics across locations is still on
an increasing trend since the early 2000s. We
release our code and dataset here: https://
github.com/iamjanvijay/acl-cite-net.
1 Introduction
Progress in science is accelerated by a sharing of
ideas. However, there have been numerous in-
stances in history where the predominance of one
group of people in science, and the silencing of oth-
ers, has led to the publication of harmful pseudo-
science (Gould et al.,1996;Saini,2019). Partic-
ularly egregious examples include the publication
of theories and ideas on racial hierarchy (Plutzer,
Equal contribution.
2013), male superiority (Huang et al.,2020), gen-
der binary (Darwin,2017), and eugenics (Cottrol,
2015). It has also been shown that a lack of in-
clusion in invention and discovery leads to fewer
technologies for the excluded group. For example,
Koning et al. (2021) show how fewer technologies
and health products are designed for women and
Bender (2011), Bird (2020) and Mohammad (2019)
show how a number of language technologies are
designed for only a small number of languages.
In this paper, we explore geographic inclusion
in Natural Language Processing (NLP) research.
Our premise is that in a fair world, people have
equitable opportunities to education, to conduct
scientific research, and to publish, regardless of
where they live. However, researchers in the field
know that a vast number of papers accepted at top
NLP conferences and journals come from a handful
of western countries and (lately) China. On the
other hand, very few papers with African and South
American authors are published.
Further, the papers that get a majority of cita-
tions tend to be from a small number of institutions.
Highly funded universities and research labs also
tend to garner greater early visibility for their pa-
pers. Some of these papers might be cited more
simply because the affiliate university or lab is per-
ceived as prestigious (Amara et al.,2015;Hurley
et al.,2013). Price (1965) examined the growth
of citation networks and showed that papers with
more early citations are likely to be cited more in
the future (the “rich get richer” phenomenon).
Citations received by a research article serve as
one of the key quantitative metrics to estimate its
impact. Citations-based metrics, such as h-index
(Hirsch,2005;Bornmann and Daniel,2009), can
have a considerable impact on a researcher’s career,
funding received, and future research collabora-
tions. Citation metrics are also commonly taken
into consideration in determining university rank-
ings and overall scientific outcomes from a country.
arXiv:2210.14424v1 [cs.CL] 26 Oct 2022
Thus, the degree of equity in citations across geo-
graphic regions can act as one of the barometers
of fairness in research. Furthermore, geographic
location directly correlates to the languages spoken
in an area. Therefore, to increase the reach of NLP
beyond high-resource languages, it is important
to elevate the research pursued in languages from
these under-represented regions.
In this work, we investigate the impact of a re-
searcher’s geographic location on their citability
for the field of NLP. We examine tens of thousands
of articles in the ACL Anthology (AA) (a digi-
tal repository of public domain NLP articles), and
generate citation networks for these papers using
information from Semantic Scholar, to quantify
and better understand disparities in citation based
on the geographic location of a researcher. We
consider a set of candidate factors that might im-
pact citations received and perform both qualitative
and quantitative analyses to better understand the
degree to which they correlate with high citations.
However, it should be noted that we do not ex-
plore the cause of citation disparities. Reasons
behind such location-based disparities are often
complex, inter-sectional, and difficult to disentan-
gle. Through this work we aim at bringing the
attention of the community to geographic dispari-
ties in research. We hope that work in this direction
will inspire actionable steps to improve geographic
inclusiveness and fairness in research.
2 Dataset
As of January 2022, the ACL Anthology (AA) had
71,568 papers.
1
We extracted paper title, names of
authors, year of publication, and venue of publica-
tion for each of these papers from the repository.
Further, we used information about the AA papers
in Semantics Scholar
2
to identify which AA papers
cite which other AA papers — the AA citation net-
work. Since the meta-information of the papers
in AA and Semantic Scholar does not include the
affiliation or location of the authors, we developed
a simple heuristic-based approach to obtain affilia-
tion information from the text of the paper.
We refer to our dataset as the AA Citation Cor-
pus. It includes the AA citation graph, author
names, unique author ids (retrieved from Semantic
Scholar), conference or workshop title, month and
year of publication, and country associated with
1https://aclanthology.org/
2https://www.semanticscholar.org/
the author’s affiliation. We make the AA Citation
Corpus freely available.
Detailed steps in the construction of the citation
network and the extraction of affiliated country
information are described in the subsections below.
2.1 Citation Graph Construction
To create the citation graph, we collected the Bib-
TeX entries of all the papers in the anthology. We
filtered out the entries which were not truly re-
search papers such as forewords, prefaces, pro-
grams, schedules, indexes, invited talks, appen-
dices, session information, newsletters, lists of pro-
ceedings, etc. Next, we used Semantic Scholar
APIs
3
to identify unique Semantic Scholar ID
(SSID) corresponding to each paper in the BibTeX.
For this, we queried the Semantic Scholar APIs in
two ways: (a) Using the ACL ID present in BibTeX,
which ensures that correct SSID was retrieved for
a paper in BibTeX; and (b) for papers whose SSID
cannot be retrieved using ACL ID, we searched the
paper using the paper title mentioned in BibTeX.
In (b), to ensure correctness of the retrieved SSID,
we take the fuzzy string matching score
4
between
title in BibTeX and that retrieved from Semantic
Scholar. SSIDs with fuzzy score greater than 85%
are marked as correct. For the remaining retrieved
SSIDs, we manually compared the title in BibTeX
and the one retrieved from Semantic Scholar.
We were able to retrieve correct SSIDs for
98.63%
of the papers in the ACL Anthology. Fi-
nally, we queried the Semantic Scholar APIs with
the SSIDs of each of the AA papers to retrieve the
SSIDs of the papers cited in the AA papers. With
this information, we created the AA citation graph.
2.2 Country Information Extraction
We inferred the authors’ affiliated country from
the textual information extracted from the research
paper PDFs. We used SciPDF Parser
5
, a python
parser for scientific PDF, to extract text from any
PDF. Section-based parsing by this tool helps us
to concentrate only on the header, which contains
information about the author’s affiliation. The con-
siderable differences in templates of papers pub-
lished across different venues and years presented
several challenges. We first compiled an exhaustive
list of countries and their universities from the web.
3https://www.semanticscholar.org/product/api
4https://pypi.org/project/fuzzywuzzy
5https://github.com/titipata/scipdf_parser
# Countries Number of papers
0: no country 14,818
1: one country 48,815
>1: multiple countries 7,062
Table 1: Count of papers by the number of automati-
cally inferred affiliated countries.
(Details in Appendix C.) For each paper, we exam-
ine the affiliation section to identify mentions of a
country (using our list of countries).
6
Using this
approach we were able to map each paper with its
affiliated country. Table 1shows the number of pa-
pers having n-country tags, where
n={0,1,>
1
}
represents no country, one country and multiple
countries respectively. Further, as the mapping of
paper to the country was automatically constructed,
the authors manually annotated the ground truth
country tag for 1000 papers. This was done to ana-
lyze the correctness of the automatically identified
country tags. These papers were selected at random
from the dataset. Out of 1000 papers, country tags
for 845 (
84.5%
) exactly match the ground truth.
For most of the remaining unmatched cases, the
algorithm either missed one country from the list
or was unable to find any country tag for the paper.
3 Disparity in Citation based on Location
We use the AA citation Corpus to answer a series of
questions on disparity of publications and citations
across geographic location. We start with a look at
the number of publications from around the world,
followed by an examination of their citations.
Q1. Is there a disparity in the number of NLP
publications across different countries? How
does the amount of publications correlate with
linguistic diversity?
A.
We used counts of papers from the AA Citation
Corpus to determine the number of papers from
each country, as visualized in Figure 1. For an even
coarser view, we also examined a partition of the
world into ten regions.
7
We calculated the total
6
Even if an author has multiple affiliations (countries) we
only consider the ones mentioned in the paper.
7
One can partition the world map into regions in many
ways. We made use of the partition provided by the United
Nations Geo-scheme:
https://en.wikipedia.org/wiki/
List_of_countries_by_United_Nations_geoscheme
.
This list includes seventeen subregions, and we combine some
of these subregions into ten coarser regions for simplicity.
number of papers from each region by aggregating
papers from all countries present in this region. We
also aggregate citation counts of papers by region.
Discussion
Figure 1shows huge disparities for
the number of publications among countries.
The western world which includes United States,
Canada, United Kingdom, France, Germany, etc.
dominates the network with high publication count.
On the other hand, most countries in Africa, South
America, Eastern Europe, South East Asia, and
Middle East remain in the red zone with very few
publications till date. When examining language
diversity
8
(indicated by size of yellow dot), we
see that countries in the red zone have the highest
language diversity. Higher linguistic diversity indi-
cates larger number of different languages spoken
in that geographic region. For example, the list of
countries with the highest number of languages in-
cludes: Indonesia (710), Nigeria (524), India (453),
and Brazil (228).9
More work on these languages is needed, by lo-
cal researchers in partnership with the language
communities. One recent effort in this regard
is project Masakhane, a grassroots organisation
whose mission is to strengthen and spur NLP
research in African languages, for Africans, by
Africans.
10
This analysis showcases the huge dis-
parity in the number of publication from each coun-
try. Through the questions ahead, we further un-
cover geographic patterns in citations, across these
mid-tier and top-tier publishing countries.
Q2. How has the citation count ("influence" of
NLP research) of papers from different regions
changed over the years?
A.
To study this question, we examine the follow-
ing metric: mean citation count per paper for each
country until certain year. Formally, this metric can
be defined as follows:
MC(j,k)=iPkCk(i)Iij
iPk
Iij
where
MC(j,k)
indicates mean-citation count of
country-
j
until year-
k
.
Ck(i)
indicates citation
count of paper-
i
until year-
k
.
Iij
is
1
if paper-
i
belongs to country-
j
otherwise
0
.
Pk
indicates
8https://en.wikipedia.org/wiki/Linguistic_
diversity_index
9https://en.wikipedia.org/wiki/Number_of_
languages_by_country
10https://www.masakhane.io/
摘要:

GeographicCitationGapsinNLPResearchMukundRungtaµº,JanvijaySinghµº,SaifM.Mohammad·,DiyiYang�µSchoolofInteractiveComputing,GeorgiaInstituteofTechnology·NationalResearchCouncilCanada�StanfordUniversity{mrungta8,iamjanvijay}@gatech.edusaif.mohammad@nrc-cnrc.gc.cadiyiy@cs.stanford.eduAbstractInafairworld...

展开>> 收起<<
Geographic Citation Gaps in NLP Research Mukund Rungtaµº Janvijay Singhµº Saif M. Mohammad Diyi Yang µSchool of Interactive Computing Georgia Institute of Technology.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:3.79MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注