
Thus, the degree of equity in citations across geo-
graphic regions can act as one of the barometers
of fairness in research. Furthermore, geographic
location directly correlates to the languages spoken
in an area. Therefore, to increase the reach of NLP
beyond high-resource languages, it is important
to elevate the research pursued in languages from
these under-represented regions.
In this work, we investigate the impact of a re-
searcher’s geographic location on their citability
for the field of NLP. We examine tens of thousands
of articles in the ACL Anthology (AA) (a digi-
tal repository of public domain NLP articles), and
generate citation networks for these papers using
information from Semantic Scholar, to quantify
and better understand disparities in citation based
on the geographic location of a researcher. We
consider a set of candidate factors that might im-
pact citations received and perform both qualitative
and quantitative analyses to better understand the
degree to which they correlate with high citations.
However, it should be noted that we do not ex-
plore the cause of citation disparities. Reasons
behind such location-based disparities are often
complex, inter-sectional, and difficult to disentan-
gle. Through this work we aim at bringing the
attention of the community to geographic dispari-
ties in research. We hope that work in this direction
will inspire actionable steps to improve geographic
inclusiveness and fairness in research.
2 Dataset
As of January 2022, the ACL Anthology (AA) had
71,568 papers.
1
We extracted paper title, names of
authors, year of publication, and venue of publica-
tion for each of these papers from the repository.
Further, we used information about the AA papers
in Semantics Scholar
2
to identify which AA papers
cite which other AA papers — the AA citation net-
work. Since the meta-information of the papers
in AA and Semantic Scholar does not include the
affiliation or location of the authors, we developed
a simple heuristic-based approach to obtain affilia-
tion information from the text of the paper.
We refer to our dataset as the AA Citation Cor-
pus. It includes the AA citation graph, author
names, unique author ids (retrieved from Semantic
Scholar), conference or workshop title, month and
year of publication, and country associated with
1https://aclanthology.org/
2https://www.semanticscholar.org/
the author’s affiliation. We make the AA Citation
Corpus freely available.
Detailed steps in the construction of the citation
network and the extraction of affiliated country
information are described in the subsections below.
2.1 Citation Graph Construction
To create the citation graph, we collected the Bib-
TeX entries of all the papers in the anthology. We
filtered out the entries which were not truly re-
search papers such as forewords, prefaces, pro-
grams, schedules, indexes, invited talks, appen-
dices, session information, newsletters, lists of pro-
ceedings, etc. Next, we used Semantic Scholar
APIs
3
to identify unique Semantic Scholar ID
(SSID) corresponding to each paper in the BibTeX.
For this, we queried the Semantic Scholar APIs in
two ways: (a) Using the ACL ID present in BibTeX,
which ensures that correct SSID was retrieved for
a paper in BibTeX; and (b) for papers whose SSID
cannot be retrieved using ACL ID, we searched the
paper using the paper title mentioned in BibTeX.
In (b), to ensure correctness of the retrieved SSID,
we take the fuzzy string matching score
4
between
title in BibTeX and that retrieved from Semantic
Scholar. SSIDs with fuzzy score greater than 85%
are marked as correct. For the remaining retrieved
SSIDs, we manually compared the title in BibTeX
and the one retrieved from Semantic Scholar.
We were able to retrieve correct SSIDs for
98.63%
of the papers in the ACL Anthology. Fi-
nally, we queried the Semantic Scholar APIs with
the SSIDs of each of the AA papers to retrieve the
SSIDs of the papers cited in the AA papers. With
this information, we created the AA citation graph.
2.2 Country Information Extraction
We inferred the authors’ affiliated country from
the textual information extracted from the research
paper PDFs. We used SciPDF Parser
5
, a python
parser for scientific PDF, to extract text from any
PDF. Section-based parsing by this tool helps us
to concentrate only on the header, which contains
information about the author’s affiliation. The con-
siderable differences in templates of papers pub-
lished across different venues and years presented
several challenges. We first compiled an exhaustive
list of countries and their universities from the web.
3https://www.semanticscholar.org/product/api
4https://pypi.org/project/fuzzywuzzy
5https://github.com/titipata/scipdf_parser