1 Wikinformetrics Construction and description of an open Wikipedia knowledge graph dataset for informetric purposes

2025-04-30 0 0 1.37MB 44 页 10玖币
侵权投诉
1
Wikinformetrics: Construction and description of an open Wikipedia
knowledge graph dataset for informetric purposes
Wenceslao Arroyo-Machado1*, Daniel Torres-Salinas1, Rodrigo Costas2,3
1 Department of Information and Communication Sciences, University of Granada, Granada, Spain
2 Centre for Science and Technology Studies (CWTS), Leiden University, Leiden, The Netherlands
3 DSI-NRF Centre of Excellence in Scientometrics and Science, Technology and Innovation Policy,
Stellenbosch University, Stellenbosch, South Africa
* Corresponding author. Email: wences@ugr.es
Abstract
Wikipedia is one of the most visited websites in the world and is also a frequent subject
of scientific research. However, the analytical possibilities of Wikipedia information have not
yet been analyzed considering at the same time both a large volume of pages and attributes.
The main objective of this work is to offer a methodological framework and an open knowledge
graph for the informetric large-scale study of Wikipedia. Features of Wikipedia pages are
compared with those of scientific publications to highlight the (di)similarities between the two
types of documents. Based on this comparison, different analytical possibilities that Wikipedia
and its various data sources offer are explored, ultimately offering a set of metrics meant to
study Wikipedia from different analytical dimensions. In parallel, a complete dedicated dataset
of the English Wikipedia was built (and shared) following a relational model. Finally, a
descriptive case study is carried out on the English Wikipedia dataset to illustrate the analytical
potential of the knowledge graph and its metrics.
Keyworkds
Wikipedia; Informetrics; Scientometrics; Altmetrics; metrics; indicators; knowledge graph;
data; dataset
2
1. Introduction
On January 15, 2001 Wikipedia was born under the umbrella of Nupedia, an
encyclopedia project whose edition was based on a peer review system. Due to the lack of
agility in publishing articles, Wikipedia was created as a feeder project, as its objective was to
make the creation of new articles easier before they were reviewed (History of Wikipedia,
2021). Wikipedia combined in a single project different elements that were new on the web
and that made possible for the first time a universal encyclopedia (Reagle, 2009). It was
successful enough to make Nupedia disappear in two years, experiencing a steady growth.
Since then, Wikipedia has become one of the top visited websites of the world
(https://www.semrush.com/website/top/, consulted on August 4, 2022), having 328 different
editions, 285 of them having more than 1000 articles
(https://meta.wikimedia.org/wiki/List_of_Wikipedias, consulted on August 4, 2022). Although
this is the most successful project of Wikimedia Foundation, there are also other well-known
knowledge projects using wikis as a basis (e.g., the Wiktionary dictionary or the Wikidata
knowledge base).
Wikipedia has been a disruptive innovation, finding in its open nature and decentralized
knowledge development one of its key elements (Olleros, 2008). Not only can everyone access
its contents free of charge, but they can also participate in its construction, in a fully transparent
process. This social construction of the knowledge can be seen in the differences found among
language editions of the same Wikipedia pages (Hara & Doney, 2015). Wikipedia contents are
also the result of consensus among editors or wikipedians. This consensus is built in open
discussions in the so-called Wikipedia talks' pages (Maki et al., 2017; Yasseri et al., 2012),
open to anyone and capturing transnational debates around Wikipedia contents (Kopf, 2020).
Some of these talks and debates have sometimes transcended Wikipedia itself (O’Neil, 2017).
As an online encyclopedia, Wikipedia is not exempt from problems. The reliability of its
content has been much debated since it is based on contributions from anonymous individuals
(Olleros, 2008). The quality of Wikipedia pages’ content has been studied numerous times
from different perspectives, especially with regard to medical content pages, pointing out
limitations such as occasional incomplete or imprecise information (C. E. Adams et al., 2020;
Candelario et al., 2017; Weiner et al., 2019). The importance of integrating Wikipedia into
academia, both in its use and in its development, has been highlighted (Jemielniak, 2019).
3
Social and cultural inequalities have also been pointed out, for example racial and gender gaps
in its biographies (J. Adams et al., 2019; Tripodi, 2021).
Wikipedia is not free of bots and vandalism, although they do not constitute a serious
threat to its contents and reliability and Wikipedia's policy does not allow detrimental use of
the activity of bots or automated accounts. Most of the bots on Wikipedia are publicly identified
(https://en.wikipedia.org/wiki/Special:ListUsers/bot), and they contribute to improving the
content and structure of Wikipedia articles (Arroyo-Machado et al., 2020; Zheng et al., 2019).
Bots also help to control and reduce problems of vandalism and trolls as they eliminate their
harmful edits of articles in advance of human editors. There is also no shortage of proposals
for methods based on machine learning to prevent this type of harmful activity (Martinez-Rico
et al., 2019).
In spite of all previous issues, the general idea is that Wikipedia is a transparent and
reliable source of encyclopedic information (Lageard & Paternotte, 2021), with value of its
own to be subject of scientific research.
1.1. Wikipedia as source for informetric research
Wikipedia has been researched from different scientific perspectives. One of them is
informetrics, quantitatively studying the contents and activity generated on Wikipedia. Thus,
Wikipedia has been studied from the points of view of scientometrics, bibliometrics and
webometrics, which are discussed in detail below.
Bibliographic references made in Wikipedia have been studied, particularly since the
emergence of the notion of “altmetrics” (Priem et al., 2010), which considered citations on
Wikipedia to scientific literature as part of its realm
1
. Wikipedia citations are one of the most
popular sources covered in altmetric aggregators (Ortega, 2020; Zahedi & Costas, 2018) like
Altmetric.com, PlumX or Crossref Event Data. In addition to altmetric data providers, there
are also several other open data sources providing extensive metadata on Wikipedia citations
(Singh et al., 2020; Zagorova et al., 2022). Moreover, other proposals like Scholia, enable
exploring bibliographic data at different levels through Wikidata (F. Å. Nielsen et al., 2017).
In Table 1 a summary of previous studies on Wikipedia bibliographic references are presented.
1
Although Wikipedia references had been already studied for years before the birth of altmetrics, like the citation analysis by
F. A. Nielsen (2007) or, in a more qualitative way, that of Mühlhauser and Oser (2008).
4
Table 1. Main studies on the bibliographic references included in Wikipedia pages.
Reference
Application
Data
Methodological approach
Language
edition
Topic analized
Mühlhauser and Oser (Mühlhauser & Oser, 2008)
Content and quality analysis
---
Check list
German
Health care
Candelario et al. (Candelario et al., 2017)
Content and quality analysis
33 pages
Scoring system
English
Medication
Kaffee and Elsahar (Kaffee & Elsahar, 2021)
Analyze the editors' citation process
---
Survey and interviews
Multilingual
Multidisciplinary
Nielsen (F. A. Nielsen, 2007)
Analyze citation patterns
30,368 citations
Descriptive statistics
English
Multidisciplinary
Kousha and Thelwall (Kousha & Thelwall, 2017)
Evaluate the impact of references
36,191 citations
Descriptive statistics
Multilingual
Multidisciplinary
Lewoniewski et al. (Lewoniewski et al., 2017)
References coverage across languages
6.8 million pages
41 million citations
Descriptive statistics
Multilingual
Multidisciplinary
Maggio et al. (Maggio et al., 2017)
Analyze citation patterns
229,857 pages
1,049,025 citations
Descriptive statistics
English
Medicine
Pooladian and Borrego (Pooladian & Borrego, 2017)
Evaluate the impact of references
982 citations
Descriptive analysis
Multilingual
Multidisciplinary
Jemielniak et al. (Jemielniak et al., 2019)
Rank journals by citations
11,325 pages
137,889 citations
Citation analysis
English
Medicine
Torres-Salinas et al. (Torres-Salinas et al., 2019)
Mapping of knowledge structure
25,555 pages
41,655 citations
Co-citation analysis
English
Arts & Humanities
Arroyo-Machado et al. (Arroyo-Machado et al., 2020)
Mapping of knowledge structure
193,802 pages
847,512 citations
Co-citation analysis
English
Multidisciplinary
Colavizza (Colavizza, 2020)
Publications coverage
3,083 ref. pub.
Topic modeling and regression analysis
English
COVID-19
Nicholson et al. (Nicholson et al., 2021)
Reviewing citation quality
1,923,575 pages
824,298 ref. pub.
Classification modeling
English
Multidisciplinary
Singh et al. (Singh et al., 2020)
Dataset creation
4 million citations
Text mining
English
Multidisciplinary
Zagorova et al. (Zagorova et al., 2022)
Dataset creation
6,073,708 pages
55 million citations
Text mining
English
Multidisciplinary
5
Kaffee and Elsahar (2021) explored the flow that wikipedians follow to include
references in Wikipedia articles. Kousha and Thelwall (2017), and Pooladian and Borrego
(2017) described the problems of Wikipedia citations in performance evaluation. Nicholson et
al. (2021) studied the quality of cited references in Wikipedia. Lewoniewski et al. (2017)
showed that the different language editions of the same Wikipedia page tended to cite common
sources, with the largest overlap between English and German; and some differences
depending on the topics. Colavizza (2020) studied the coverage of the scientific literature on
COVID-19 on Wikipedia, showing that although there was only a small percentage of scientific
literature on COVID-19 in Wikipedia, it was sufficiently representative of its various topics.
Arroyo-Machado et al. (2020) and Torres-Salinas et al. (2019) mapped Wikipedia co-citations
patterns, showing fundamental differences in the use of scientific literature in Wikipedia
compared to the academic realm. Bould et al. (2014), Li et al. (2021), and Tomaszewski and
MacDonald (2016) studied academic citations in scientific publications to Wikipedia articles,
proving that scientific publications also use Wikipedia content in their citations, as well as other
digital encyclopedias, especially in areas such as Chemistry, Physics or Mathematics.
Wikipedia has also been the subject of webometric studies. For example, Wikiometrics
were proposed as a rating system to rank universities or journals based on the features of their
Wikipedia pages, also finding positive correlations with existing academic rankings (Katz &
Rokach, 2017). The estimation of the importance of Wikipedia pages based on the PageRank
algorithm was also studied, correlating positively with other page-view-based rankings
(Thalhammer & Rettinger, 2016). Miquel-Ribé and Laniado (2018) showed that the different
language editions of Wikipedia pages reflect cultural differences, as the contents cover local
topics corresponding to different linguistic regions. Other studies focused on metrics about the
attention generated around Wikipedia articles (e.g., likes or page view counts), showing how
they reflect current topics of interest at a particular time/region (Dzogang et al., 2016;
Mittermeier et al., 2019, 2021; Roll et al., 2016; Vilain et al., 2017), and even demonstrating
the potential of Wikipedia pages to monitor the spread of diseases (Generous et al., 2014).
There are also numerous studies around Wikipedia's informetric features. Wilkinson and
Huberman (2007) found a correlation between the quality of Wikipedia articles and their
number of edits. The relationship between the length of Wikipedia articles and their quality has
been highlighted by Blumenstock (2008). Beyond quality, relationships between Wikipedia
metrics have also been explored. Previous studies found positive correlations between views
摘要:

1Wikinformetrics:ConstructionanddescriptionofanopenWikipediaknowledgegraphdatasetforinformetricpurposesWenceslaoArroyo-Machado1*,DanielTorres-Salinas1,RodrigoCostas2,31DepartmentofInformationandCommunicationSciences,UniversityofGranada,Granada,Spain2CentreforScienceandTechnologyStudies(CWTS),LeidenU...

展开>> 收起<<
1 Wikinformetrics Construction and description of an open Wikipedia knowledge graph dataset for informetric purposes.pdf

共44页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:44 页 大小:1.37MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 44
客服
关注