Geographic Citation Gaps in NLP Research Mukund Rungtaµº Janvijay Singhµº Saif M. Mohammad Diyi Yang µSchool of Interactive Computing Georgia Institute of Technology

2025-05-06 0 0 3.79MB 13 页 10玖币

侵权投诉

Geographic Citation Gaps in NLP Research

Mukund Rungta♢♣, Janvijay Singh♢♣, Saif M. Mohammad♡, Diyi Yang◊

♢School of Interactive Computing, Georgia Institute of Technology

♡National Research Council Canada

◊Stanford University

{mrungta8, iamjanvijay}@gatech.edu

saif.mohammad@nrc-cnrc.gc.ca

diyiy@cs.stanford.edu

Abstract

In a fair world, people have equitable oppor-

tunities to education, to conduct scientiﬁc re-

search, to publish, and to get credit for their

work, regardless of where they live. However,

it is common knowledge among researchers

that a vast number of papers accepted at top

NLP venues come from a handful of western

countries and (lately) China; whereas, very

few papers from Africa and South America

get published. Similar disparities are also be-

lieved to exist for paper citation counts. In the

spirit of “what we do not measure, we cannot

improve”, this work asks a series of questions

on the relationship between geographical loca-

tion and publication success (acceptance in top

NLP venues and citation impact). We ﬁrst cre-

ated a dataset of 70,000 papers from the ACL

Anthology, extracted their meta-information,

and generated their citation network. We then

show that not only are there substantial geo-

graphical disparities in paper acceptance and

citation but also that these disparities persist

even when controlling for a number of vari-

ables such as venue of publication and sub-

ﬁeld of NLP. Further, despite some steps taken

by the NLP community to improve geograph-

ical diversity, we show that the disparity in

publication metrics across locations is still on

an increasing trend since the early 2000s. We

release our code and dataset here: https://

github.com/iamjanvijay/acl-cite-net.

1 Introduction

Progress in science is accelerated by a sharing of

ideas. However, there have been numerous in-

stances in history where the predominance of one

group of people in science, and the silencing of oth-

ers, has led to the publication of harmful pseudo-

science (Gould et al.,1996;Saini,2019). Partic-

ularly egregious examples include the publication

of theories and ideas on racial hierarchy (Plutzer,

♣Equal contribution.

2013), male superiority (Huang et al.,2020), gen-

der binary (Darwin,2017), and eugenics (Cottrol,

2015). It has also been shown that a lack of in-

clusion in invention and discovery leads to fewer

technologies for the excluded group. For example,

Koning et al. (2021) show how fewer technologies

and health products are designed for women and

Bender (2011), Bird (2020) and Mohammad (2019)

show how a number of language technologies are

designed for only a small number of languages.

In this paper, we explore geographic inclusion

in Natural Language Processing (NLP) research.

Our premise is that in a fair world, people have

equitable opportunities to education, to conduct

scientiﬁc research, and to publish, regardless of

where they live. However, researchers in the ﬁeld

know that a vast number of papers accepted at top

NLP conferences and journals come from a handful

of western countries and (lately) China. On the

other hand, very few papers with African and South

American authors are published.

Further, the papers that get a majority of cita-

tions tend to be from a small number of institutions.

Highly funded universities and research labs also

tend to garner greater early visibility for their pa-

pers. Some of these papers might be cited more

simply because the afﬁliate university or lab is per-

ceived as prestigious (Amara et al.,2015;Hurley

et al.,2013). Price (1965) examined the growth

of citation networks and showed that papers with

more early citations are likely to be cited more in

the future (the “rich get richer” phenomenon).

Citations received by a research article serve as

one of the key quantitative metrics to estimate its

impact. Citations-based metrics, such as h-index

(Hirsch,2005;Bornmann and Daniel,2009), can

have a considerable impact on a researcher’s career,

funding received, and future research collabora-

tions. Citation metrics are also commonly taken

into consideration in determining university rank-

ings and overall scientiﬁc outcomes from a country.

arXiv:2210.14424v1 [cs.CL] 26 Oct 2022

Thus, the degree of equity in citations across geo-

graphic regions can act as one of the barometers

of fairness in research. Furthermore, geographic

location directly correlates to the languages spoken

in an area. Therefore, to increase the reach of NLP

beyond high-resource languages, it is important

to elevate the research pursued in languages from

these under-represented regions.

In this work, we investigate the impact of a re-

searcher’s geographic location on their citability

for the ﬁeld of NLP. We examine tens of thousands

of articles in the ACL Anthology (AA) (a digi-

tal repository of public domain NLP articles), and

generate citation networks for these papers using

information from Semantic Scholar, to quantify

and better understand disparities in citation based

on the geographic location of a researcher. We

consider a set of candidate factors that might im-

pact citations received and perform both qualitative

and quantitative analyses to better understand the

degree to which they correlate with high citations.

However, it should be noted that we do not ex-

plore the cause of citation disparities. Reasons

behind such location-based disparities are often

complex, inter-sectional, and difﬁcult to disentan-

gle. Through this work we aim at bringing the

attention of the community to geographic dispari-

ties in research. We hope that work in this direction

will inspire actionable steps to improve geographic

inclusiveness and fairness in research.

2 Dataset

As of January 2022, the ACL Anthology (AA) had

71,568 papers.

We extracted paper title, names of

authors, year of publication, and venue of publica-

tion for each of these papers from the repository.

Further, we used information about the AA papers

in Semantics Scholar

to identify which AA papers

cite which other AA papers — the AA citation net-

work. Since the meta-information of the papers

in AA and Semantic Scholar does not include the

afﬁliation or location of the authors, we developed

a simple heuristic-based approach to obtain afﬁlia-

tion information from the text of the paper.

We refer to our dataset as the AA Citation Cor-

pus. It includes the AA citation graph, author

names, unique author ids (retrieved from Semantic

Scholar), conference or workshop title, month and

year of publication, and country associated with

1https://aclanthology.org/

2https://www.semanticscholar.org/

the author’s afﬁliation. We make the AA Citation

Corpus freely available.

Detailed steps in the construction of the citation

network and the extraction of afﬁliated country

information are described in the subsections below.

2.1 Citation Graph Construction

To create the citation graph, we collected the Bib-

TeX entries of all the papers in the anthology. We

ﬁltered out the entries which were not truly re-

search papers such as forewords, prefaces, pro-

grams, schedules, indexes, invited talks, appen-

dices, session information, newsletters, lists of pro-

ceedings, etc. Next, we used Semantic Scholar

APIs

to identify unique Semantic Scholar ID

(SSID) corresponding to each paper in the BibTeX.

For this, we queried the Semantic Scholar APIs in

two ways: (a) Using the ACL ID present in BibTeX,

which ensures that correct SSID was retrieved for

a paper in BibTeX; and (b) for papers whose SSID

cannot be retrieved using ACL ID, we searched the

paper using the paper title mentioned in BibTeX.

In (b), to ensure correctness of the retrieved SSID,

we take the fuzzy string matching score

between

title in BibTeX and that retrieved from Semantic

Scholar. SSIDs with fuzzy score greater than 85%

are marked as correct. For the remaining retrieved

SSIDs, we manually compared the title in BibTeX

and the one retrieved from Semantic Scholar.

We were able to retrieve correct SSIDs for

98.63%

of the papers in the ACL Anthology. Fi-

nally, we queried the Semantic Scholar APIs with

the SSIDs of each of the AA papers to retrieve the

SSIDs of the papers cited in the AA papers. With

this information, we created the AA citation graph.

2.2 Country Information Extraction

We inferred the authors’ afﬁliated country from

the textual information extracted from the research

paper PDFs. We used SciPDF Parser

, a python

parser for scientiﬁc PDF, to extract text from any

PDF. Section-based parsing by this tool helps us

to concentrate only on the header, which contains

information about the author’s afﬁliation. The con-

siderable differences in templates of papers pub-

lished across different venues and years presented

several challenges. We ﬁrst compiled an exhaustive

list of countries and their universities from the web.

3https://www.semanticscholar.org/product/api

4https://pypi.org/project/fuzzywuzzy

5https://github.com/titipata/scipdf_parser

# Countries Number of papers

0: no country 14,818

1: one country 48,815

>1: multiple countries 7,062

Table 1: Count of papers by the number of automati-

cally inferred afﬁliated countries.

(Details in Appendix C.) For each paper, we exam-

ine the afﬁliation section to identify mentions of a

country (using our list of countries).

Using this

approach we were able to map each paper with its

afﬁliated country. Table 1shows the number of pa-

pers having n-country tags, where

n={0,1,>

}

represents no country, one country and multiple

countries respectively. Further, as the mapping of

paper to the country was automatically constructed,

the authors manually annotated the ground truth

country tag for 1000 papers. This was done to ana-

lyze the correctness of the automatically identiﬁed

country tags. These papers were selected at random

from the dataset. Out of 1000 papers, country tags

for 845 (

84.5%

) exactly match the ground truth.

For most of the remaining unmatched cases, the

algorithm either missed one country from the list

or was unable to ﬁnd any country tag for the paper.

3 Disparity in Citation based on Location

We use the AA citation Corpus to answer a series of

questions on disparity of publications and citations

across geographic location. We start with a look at

the number of publications from around the world,

followed by an examination of their citations.

Q1. Is there a disparity in the number of NLP

publications across different countries? How

does the amount of publications correlate with

linguistic diversity?

We used counts of papers from the AA Citation

Corpus to determine the number of papers from

each country, as visualized in Figure 1. For an even

coarser view, we also examined a partition of the

world into ten regions.

We calculated the total

Even if an author has multiple afﬁliations (countries) we

only consider the ones mentioned in the paper.

One can partition the world map into regions in many

ways. We made use of the partition provided by the United

Nations Geo-scheme:

https://en.wikipedia.org/wiki/

List_of_countries_by_United_Nations_geoscheme

This list includes seventeen subregions, and we combine some

of these subregions into ten coarser regions for simplicity.

number of papers from each region by aggregating

papers from all countries present in this region. We

also aggregate citation counts of papers by region.

Discussion

Figure 1shows huge disparities for

the number of publications among countries.

The western world which includes United States,

Canada, United Kingdom, France, Germany, etc.

dominates the network with high publication count.

On the other hand, most countries in Africa, South

America, Eastern Europe, South East Asia, and

Middle East remain in the red zone with very few

publications till date. When examining language

diversity

(indicated by size of yellow dot), we

see that countries in the red zone have the highest

language diversity. Higher linguistic diversity indi-

cates larger number of different languages spoken

in that geographic region. For example, the list of

countries with the highest number of languages in-

cludes: Indonesia (710), Nigeria (524), India (453),

and Brazil (228).9

More work on these languages is needed, by lo-

cal researchers in partnership with the language

communities. One recent effort in this regard

is project Masakhane, a grassroots organisation

whose mission is to strengthen and spur NLP

research in African languages, for Africans, by

Africans.

This analysis showcases the huge dis-

parity in the number of publication from each coun-

try. Through the questions ahead, we further un-

cover geographic patterns in citations, across these

mid-tier and top-tier publishing countries.

Q2. How has the citation count ("inﬂuence" of

NLP research) of papers from different regions

changed over the years?

To study this question, we examine the follow-

ing metric: mean citation count per paper for each

country until certain year. Formally, this metric can

be deﬁned as follows:

MC(j,k)=∑i∈PkCk(i)Ii∈j

∑i∈Pk

Ii∈j

where

MC(j,k)

indicates mean-citation count of

country-

until year-

Ck(i)

indicates citation

count of paper-

until year-

Ii∈j

if paper-

belongs to country-

otherwise

indicates

8https://en.wikipedia.org/wiki/Linguistic_

diversity_index

9https://en.wikipedia.org/wiki/Number_of_

languages_by_country

10https://www.masakhane.io/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeographicCitationGapsinNLPResearchMukundRungtaµº,JanvijaySinghµº,SaifM.Mohammad·,DiyiYang�µSchoolofInteractiveComputing,GeorgiaInstituteofTechnology·NationalResearchCouncilCanada�StanfordUniversity{mrungta8,iamjanvijay}@gatech.edusaif.mohammad@nrc-cnrc.gc.cadiyiy@cs.stanford.eduAbstractInafairworld...

展开>> 收起<<

Geographic Citation Gaps in NLP Research Mukund Rungtaµº Janvijay Singhµº Saif M. Mohammad Diyi Yang µSchool of Interactive Computing Georgia Institute of Technology.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Geographic Citation Gaps in NLP Research Mukund Rungtaµº Janvijay Singhµº Saif M. Mohammad Diyi Yang µSchool of Interactive Computing Georgia Institute of Technology

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: