The Lifecycle of Facts A Survey of Social Bias in Knowledge Graphs Angelie Kraft1andRicardo Usbeck12 1Department of Informatics Universität Hamburg Germany

2025-05-06 0 0 489.14KB 14 页 10玖币

侵权投诉

The Lifecycle of "Facts": A Survey of Social Bias in Knowledge Graphs

Angelie Kraft1and Ricardo Usbeck12

1Department of Informatics, Universität Hamburg, Germany

2Hamburger Informatik Technologie-Center e.V. (HITeC), Germany

{angelie.kraft, ricardo.usbeck}@uni-hamburg.de

Abstract

Knowledge graphs are increasingly used in a

plethora of downstream tasks or in the aug-

mentation of statistical models to improve fac-

tuality. However, social biases are engraved

in these representations and propagate down-

stream. We conducted a critical analysis of

literature concerning biases at different steps

of a knowledge graph lifecycle. We investi-

gated factors introducing bias, as well as the

biases that are rendered by knowledge graphs

and their embedded versions afterward. Limi-

tations of existing measurement and mitigation

strategies are discussed and paths forward are

proposed.

1 Introduction

Knowledge graphs (KGs) provide a structured and

transparent form of information representation and

lie at the core of popular Semantic Web technolo-

gies. They are utilized as a source of truth in a

variety of downstream tasks (e.g., information ex-

traction (Martínez-Rodríguez et al.,2020), link pre-

diction (Getoor and Taskar,2007;Ngomo et al.,

2021), or question-answering (Höffner et al.,2017;

Diefenbach et al.,2018;Chakraborty et al.,2021;

Jiang and Usbeck,2022)) and in hybrid AI systems

(e.g., knowledge-augmented language models (Pe-

ters et al.,2019;Sun et al.,2020;Yu et al.,2022) or

conversational AI (Gao et al.,2018;Gerritse et al.,

2020)). In the latter, KGs are employed to enhance

the factuality of statistical models (Athreya et al.,

2018;Rony et al.,2022). In this overview article,

we question the ethical integrity of these facts and

investigate the lifecycle of KGs (Auer et al.,2012;

Paulheim,2017) with respect to bias inﬂuences.1

We focus on the KG lifecycle from a bias and fairness

lens. For reference, the processes investigated in Section 3

correspond to the authoring stage in the taxonomy by Auer

et al. (2012). The representation issues in KGs (Section 4) and

KG embeddings (Sections 5and 7) which affect downstream

task bias relate to Auer et al.’s classiﬁcation stage.

Figure 1: Overview of the knowledge graph lifecycle

as discussed in this paper. Exclamation marks indicate

factors that introduce or amplify bias. We examine

bias-inducing factors of triple crowd-sourcing, hand-

crafted ontologies, and automated information extrac-

tion (Chapter 3), as well as the resulting social biases

in KGs (Chapter 4) and KG embeddings, including ap-

proaches for measurement and mitigation (Chapter 5).

We claim that KGs manifest social biases and

potentially propagate harmful prejudices. To uti-

lize the full potential of KG technologies, such

ethical risks must be targeted and avoided during

development and application. Using an extensive

literature analysis, this article provides a reﬂection

on previous efforts and suggestions for future work.

We collected articles via Google Scholar

and ﬁltered for titles including knowledge

graph/base/resource,ontologies,named entity

recognition, or relation extraction, paired with vari-

ants of bias,debiasing,harms,ethical, and fair-

ness. We selected peer-reviewed publications (in

journals, conference or workshop proceedings, and

book chapters) from 2010 onward, related to so-

cial bias in the KG lifecycle. This resulted in a

ﬁnal count of 18 papers. Table 1gives an overview

of the reviewed works and Figure 1illustrates the

A literature search on Science Direct, ACM Digital Li-

brary, and Springer did not provide additional results.

arXiv:2210.03353v1 [cs.CL] 7 Oct 2022

analyzed lifecycle stages.

2 Notes on Bias, Fairness, and Factuality

In the following, we clarify our operational deﬁni-

tions of the most relevant concepts in our analysis.

2.1 Bias

If we refer to a model or representation as bi-

ased, we — unless otherwise speciﬁed — mean

that the model or representation is socially biased,

i.e., biased towards certain social groups. This

is usually indicated by a systematic and unfairly

discriminating deviation in the way members of

these groups are represented compared to others

(Friedman and Nissenbaum,1996) (also known as

algorithmic bias). Such bias can stem from pre-

existing societal inequalities and attitudes, such as

prejudice and stereotypes, or arise on an algorith-

mic level, through design choices and formalization

(Friedman and Nissenbaum,1996). From a more

impact-focused perspective, algorithmic bias can

be described as "a skew that [causes] harm" (Kate

Crawford, Keynote at NIPS2017). Such harm can

manifest itself in unfair distribution of resources or

derogatory misrepresentation of a disfavored group.

We refer to fairness as the absence of bias.

2.2 Unwanted Biases and Harms

One can distinguish between allocational and rep-

resentational harms (Barocas et al., as cited in,

Blodgett et al.,2020), where the ﬁrst refers to the

unfair distribution of chances and resources and

the second more broadly denotes types of insult or

derogation, distorted representation, or lack of rep-

resentation altogether. To quantify biases that lead

to representational harm, analyses of more abstract

constructs are required. Mehrabi et al. (2021a),

for example, measure indicators of representational

harm via polarized perceptions: a predominant as-

sociation of groups with either negative or positive

prejudice, denigration, or favoritism. Polarized

perceptions are assumed to correspond to societal

stereotypes. They can overgeneralize to all mem-

bers of a social group (e.g., "all lawyers are dishon-

est"). It can be said that harm is to be prevented

by avoiding or removing algorithmic bias. How-

ever, different views on the conditions for fairness

can be found in the literature and, in consequence,

different deﬁnitions of unwanted bias.

2.3 Factuality versus Fairness

We consider a KG factual if it is representative of

the real world. For example, if it contains only male

U.S. presidents, it truthfully represents the world

as it is and has been. However, inference based

on this snapshot would lead to the prediction that

people of other genders cannot or will not become

presidents. This would be false with respect to

U.S. law and/or undermine the potential of non-

male persons. Statistical inference over historical

entities is one of the main usages of KGs. The

factuality narrative, thus, risks consolidating and

propagating pre-existing societal inequalities and

works against matters of social fairness. Even if

the data represented are not affected by sampling

errors, they are restricted to describing the world

as it is as opposed to the world as it should be. We

strive for the latter kind of inference basis. Apart

from that, in the following sections we will learn

that popular KGs are indeed affected by sampling

biases, which further amplify societal biases.

3 Entering the Lifecycle: Bias in

Knowledge Graph Creation

We enter the lifecycle view (Figure 1) by investigat-

ing the processes underlying the creation of KGs.

We focus on the human factors behind the author-

ing of ontologies and triples which constitute KGs.

Furthermore, we address automated information

extraction, i.e., the detection and extraction of enti-

ties and relations from text, since these approaches

can be subject to algorithmic bias.

3.1 Triples: Crowd-Sourcing of Facts

Popular large-scale KGs, like Wikidata (Vran-

decic and Krötzsch,2014) and DBpedia (Auer

et al.,2007) are the products of continuous crowd-

sourcing efforts. Both of these examples are closely

related to Wikipedia, where the top ﬁve languages

(English, Cebuano, German, Swedish, and French)

constitute 35% of all articles on this platform.

can be said that Wikipedia is Euro-centric in ten-

dency. Moreover, the majority of authors are white

males.

As a result, the data transport a particu-

lar homogeneous set of interests and knowledge

(Beytía et al.,2022;Wagner et al.,2015). This

sampling bias affects the geospatial coverage of

3https://en.wikipedia.org/wiki/List_

of_Wikipedias

4https://en.wikipedia.org/wiki/Gender_

bias_on_Wikipedia;https://en.wikipedia.

org/wiki/Racial_bias_on_Wikipedia

information (Janowicz et al.,2018) and leads to

higher barriers for female personalities to receive

a biographic entry (Beytía et al.,2022). In an ex-

periment, Demartini (2019) asked crowd contribu-

tors to provide a factual answer to the (politically

charged) question of whether or not Catalonia is a

part of Spain. The diverging responses indicated

that participants’ beliefs of what counts as true dif-

fered largely. This is an example of bias that is

beyond a subliminal psychological level. In this

case, structural aspects like consumed media and

social discourse play an important role. To counter

this problem, Demartini (2019) suggests actively

asking contributors for evidence supporting their

statements, as well as keeping track of their de-

mographic backgrounds. This makes underlying

motivations and possible sources for bias traceable.

3.2 Ontologies: Manual Creation of Rules

Ontologies determine rules regarding allowed types

of entities and relations or their usage. They are of-

ten hand-made and a source of bias (Janowicz et al.,

2018) due to the inﬂuence of opinions, motivations,

and personal choices (Keet,2021): Factors like sci-

entiﬁc opinions (e.g., historical ideas about race),

socio-culture (e.g., how many people a person can

be married to), or political and religious views (e.g.,

classifying a person of type X as a terrorist or a

protestor) can proximately lead to an encoding of

social bias. Also structural constraints like the on-

tologies’ granularity levels can induce bias (Keet,

2021). Furthermore, issues can arise from the types

of information used to characterize a person entity.

Whether one attributes the person with their skin

color or not could theoretically determine the emer-

gence of racist bias in a downstream application

(Paparidis and Kotis,2021). Geller and Kollapally

(2021) give a practical example for detection and

alleviation of ontology bias in a real-world scenario.

The authors discovered that ontological gaps in the

medical context lead to an under-reporting of race-

speciﬁc incidents. They were able to suggest coun-

termeasures based on a structured analysis of real

incidents and external terminological resources.

3.3 Extraction: Automated Extraction of

Information

Natural language processing (NLP) methods can

be used to recognize and extract entities (named

entity recognition; NER) and their relations (rela-

tion extraction; RE), which are then represented

[

head entity, relation, tail entity

]

tuples (or as

[subject, predicate, object], respectively).

Mehrabi et al. (2020) showed that the NER sys-

tem CoreNLP (Manning et al.,2014) exhibits bi-

nary gender bias. They used a number of tem-

plate sentences, like "<Name> is going to school"

or "<Name> is a person" using male and female

names

from 139 years of census data. The model

returned more erroneous tags for female names.

Similarly, Mishra et al. (2020) created synthetic

sentences from adjusted Winogender (Rudinger

et al.,2018) templates with names associated with

different ethnicities and genders. A range of dif-

ferent NER systems were evaluated (bidirectional

LSTMs with Conditional Random Field (BiLSTM

CRF) (Huang et al.,2015) on GloVe (Pennington

et al.,2014), ConceptNet (Speer et al.,2017) and

ELMo (Peters et al.,2017) embeddings, CoreNLP,

and spaCy

NER models). Across models, non-

white names yielded on average lower performance

scores than white names. Generally, ELMo ex-

hibited the least bias. Although ConceptNet is

debiased for gender and ethnicity

, it was found to

produce strongly varied accuracy values.

Gaut et al. (2020) analyzed binary gender bias

in a popular open-source neural relation extraction

(NRE) model, OpenNRE (Han et al.,2019). For

this purpose, the authors created a new dataset,

named WikiGenderBias (sourced from Wikipedia

and DBpedia). All sentences describe a gendered

subject with one of four relations: spouse,hyper-

nym,birthData, or birthPlace (DBpedia mostly

uses occupation-related hypernyms). The most no-

table bias found was the spouse relation. It was

more reliably predicted for male than female en-

tities. This observation stands in contrast to the

predominance of female instances with spouse rela-

tion in WikiGenderBias. The authors experimented

with three different mitigation strategies: down-

sampling the training data to equalize the number

of male and female instances, augmenting the data

by artiﬁcially introducing new female instances,

and ﬁnally word embedding debiasing (Bolukbasi

et al.,2016). Only downsampling facilitated a re-

duction of bias that did not come at the cost of

model performance.

Nowadays, contextualized transformer-based en-

While most of the works presented here refer to gender as

a binary concept, this does not agree with our understanding.

We acknowledge that gender is continuous and technology

must do this reality justice.

6https://spacy.io/

https://blog.conceptnet.io/posts/2017/conceptnet-

numberbatch-17-04-better-less-stereotyped-word-vectors/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheLifecycleof"Facts":ASurveyofSocialBiasinKnowledgeGraphsAngelieKraft1andRicardoUsbeck121DepartmentofInformatics,UniversitätHamburg,Germany2HamburgerInformatikTechnologie-Centere.V.(HITeC),Germany{angelie.kraft,ricardo.usbeck}@uni-hamburg.deAbstractKnowledgegraphsareincreasinglyusedinaplethoraofdow...

展开>> 收起<<

The Lifecycle of Facts A Survey of Social Bias in Knowledge Graphs Angelie Kraft1andRicardo Usbeck12 1Department of Informatics Universität Hamburg Germany.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Lifecycle of Facts A Survey of Social Bias in Knowledge Graphs Angelie Kraft1andRicardo Usbeck12 1Department of Informatics Universität Hamburg Germany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: