1 Identifying Counterfactual Queries with the R Package cfid

2025-04-30 0 0 294.67KB 14 页 10玖币

侵权投诉

Identifying Counterfactual Queries with

the R Package cﬁd

by Santtu Tikka

Abstract In the framework of structural causal models, counterfactual queries describe events that

concern multiple alternative states of the system under study. Counterfactual queries often take

the form of “what if” type questions such as “would an applicant have been hired if they had over

10 years of experience, when in reality they only had 5 years of experience?” Such questions and

counterfactual inference in general are crucial, for example when addressing the problem of fairness

in decision-making. Because counterfactual events contain contradictory states of the world, it is

impossible to conduct a randomized experiment to address them without making several restrictive

assumptions. However, it is sometimes possible to identify such queries from observational and

experimental data by representing the system under study as a causal model, and the available data as

symbolic probability distributions. Shpitser and Pearl (2007) constructed two algorithms, called ID*

and IDC*, for identifying counterfactual queries and conditional counterfactual queries, respectively.

These two algorithms are analogous to the ID and IDC algorithms by Shpitser and Pearl (2006a,b) for

identiﬁcation of interventional distributions, which were implemented in R by Tikka and Karvanen

(2017) in the causaleffect package. We present the R package cﬁd that implements the ID* and IDC*

algorithms. Identiﬁcation of counterfactual queries and the features of cﬁd are demonstrated via

examples.

Introduction

Pearl’s ladder of causation (or causal hierarchy) consists of three levels: association, intervention,

and counterfactual (Pearl,2009). These levels describe a hierarchy of problems with increasing

conceptual and formal difﬁculty. On the ﬁrst and lowest level, inference on associations is based

entirely on observed data in the form of questions such as “what is the probability that an event

occurs?” or “what is the correlation between two variables”. On the second level, the inference

problems are related to manipulations of the system under study such as “what is the probability

of an event if we change the value of one variable in the system”. Questions on the intervention

level cannot be answered using tools of the association level, because simply observing a change in

a system is not the same as intervening on the system. Randomized controlled trials are the gold

standard for studying the effects of interventions, because they enable the researcher to account for

confounding factors between the treatment and the outcome and to carry out the intervention in

practice. However, there are often practical limitations that make it difﬁcult, expensive, or impossible

to conduct a randomized experiment. The third and highest level is the counterfactual level. Typically,

counterfactual statements compare the real world, where an action was taken or some event was

observed, to an alternative hypothetical scenario, where a possibly different action was taken, or a

different event was observed. Counterfactuals are often challenging to understand even conceptually

due this notion of contradictory events in alternative worlds, and such alternatives need not be limited

to only two. In general, questions on the counterfactual level cannot be answered by relying solely

on the previous levels: no intervention or association is able to capture the notion of alternative

hypothetical worlds.

While counterfactual statements can be challenging, they are a core part of our everyday thinking

and discourse. Importantly, counterfactuals often consider retrospective questions about the state of

the world, such as “would an applicant have been hired if they had more work experience”. This

kind of retrospection is crucial when fair treatment of individuals is considered in hiring, healthcare,

receiving loans or insurance, etc., with regards to protected attributes, especially when the goal

is automated decision-making. Statistical approaches to fairness are insufﬁcient in most contexts,

such as in scenarios analogous to the well-known Simpson’s paradox, but routinely resolved using

the framework of causal inference. In some cases, even interventional notions of fairness may be

insufﬁcient, necessitating counterfactual fairness (Kusner et al.,2017;Zhang and Bareinboim,2018).

The structural causal model (SCM) framework of Pearl provides a formal approach to causal

inference of interventional and counterfactual causal queries (Pearl,2009). An SCM represents the

system of interest in two ways, First, the causal relationships are depicted by a directed acyclic

graph (DAG) whose vertices correspond to variables under study and whose edges depict the direct

functional causal relationships between the variables. Typically, only some of these variables are

observed and the remaining variables are considered latent, corresponding either to confounders

between multiple variables or individual random errors of single variables. Second, the uncertainty

arXiv:2210.14745v2 [stat.ME] 22 Nov 2023

related to the variables in the system is captured by assuming a joint probability distribution over its

latent variables. The functional relationships of the model induce a joint probability distribution over

the observed variables. The SCM framework also incorporates the notion of external interventions

symbolically via the do-operator, and a graphical representation of counterfactual scenarios via parallel

worlds graphs (Avin et al.,2005;Shpitser and Pearl,2007,2008).

One of the fundamental problems of causal inference is the so-called identiﬁability problem,

especially the identiﬁability of interventional distributions. Using the SCM framework and do-

calculus, it is sometimes possible to uniquely represent an interventional distribution using only

the observed joint probability distribution of the model before the intervention took place. Such

interventional distributions are called identiﬁable. More generally, we say that a causal query is

identiﬁable, if it can be uniquely represented using the available data. In most identiﬁability problems,

the available data consists of causal quantities on levels below the query in the ladder of causation, but

the levels also sometimes overlap, (e.g., Bareinboim and Pearl,2012;Tikka and Karvanen,2019;Lee

et al.,2019). The identiﬁability problem of interventional distributions, and many other interventional

identiﬁability problems have been solved by providing a sound and complete identiﬁcation algorithm

(e.g., Shpitser and Pearl,2006a;Huang and Valtorta,2006;Lee et al.,2019;Kivva et al.,2022).

Software for causal inference is becoming increasingly prominent. For R, a comprehensive

overview of the state-of-the-art is provided by the recently launched task view on Causal Inference

on the Comprehensive R Archive Network (CRAN). Out of the packages listed in this task view,

the Counterfactual (Chen et al.,2020) and WhatIf (Stoll et al.,2020) packages are directly linked

to counterfactual inference, but the focus of these packages is estimation and they do not consider

the identiﬁability of counterfactual queries. The R6causal (Karvanen,2022) package can be used to

simulate data from counterfactual scenarios in a causal model. R packages most closely related to

causal identiﬁability problems are the causaleffect (Tikka and Karvanen,2017), dosearch (Tikka et al.,

2021), and dagitty (Textor et al.,2017).

We present the ﬁrst implementation of the counterfactual identiﬁability algorithms of Shpitser and

Pearl (2007) (see also Shpitser and Pearl,2008) as the R package cﬁd (counterfactual identiﬁcation).

The cﬁd package also provides a user-friendly interface for deﬁning causal diagrams and the package

is compatible with other major R packages for causal identiﬁability problems such as causaleffect,

dosearch and dagitty by supporting graph formats used by these packages as inputs.

The paper is organized as follows. Section 2.2 introduces the notation, core concepts and deﬁnitions,

and provides an example on manual identiﬁcation of a counterfactual query without relying on the

identiﬁability algorithms. Section 2.3 presents the algorithms implemented in cﬁd and demonstrates

their functionality via examples by tracing their operation line by line. Section 2.4 demonstrates the

usage of the cﬁd package in practice. Section 2.5 concludes the paper with a summary.

Notation and deﬁnitions

We follow the notation used by Shpitser and Pearl (2008) and we assume the reader to be familiar with

standard graph theoretic concepts such as ancestral relations between vertices and d-separation. We

use capital letters to denote random variables and lower-case letters to denote their value assignments.

Bold letters are used to denote sets of random variables and counterfactual variables. We associate the

vertices of graphs with their respective random variables and value assignments in the underlying

causal models. In ﬁgures, observed variables of graphs are denoted by circles, variables ﬁxed by

interventions are denoted by squares, and latent unobserved variables are denoted by dashed circles

when explicitly included and by bidirected edges when the corresponding latent variable has two

observed children. Latent variables with only one child, which are called error terms, are not shown

for clarity.

Astructural causal model is a tuple

M= (U

P(u))

where

is a set of unobserved random

variables,

is a set of

observed random variables,

is a set of

functions such that each function

is a mapping from

U∪V\ {Vi}

and such that it is possible to represent the set

as function of

P(u)

is a joint probability distribution over

. The causal model also deﬁnes its causal diagram

Each

Vi∈V

corresponds to a vertex in

, and there is a directed edge from each

Vj∈U∪V\ {Vi}

. We restrict our attention to recursive causal models in this paper, meaning models that induce an

acyclic causal diagram.

Acounterfactual variable

denotes the variable

in the submodel

obtained from

forcing the random variables

to take the values

(often denoted by the do-operator as

do(X=x)

simply

do(x)

). The distribution of

in the submodel

is called the interventional distribution of

and it is denoted by

Px(y)

. However, if we wish to consider multiple counterfactual variables that

originate from different interventions, we must extend our notation to counterfactual conjunctions.

Counterfactual conjunctions are constructed from value assignments of counterfactual variables, and

individual assignments are separated by the

∧

symbol. For example,

yx∧zx∧x′

denotes the event

that

Yx=y

Zx=z

and

X=x′

. The probability

P(yx∧zx∧x′)

is the probability of the counterfactual

event. Note that primes do not differentiate variables, instead they are used to differentiate between

values i.e.,

is a different value from

x′

and they are both different from

x′′

but all three are value

assignments of the random variable

. If the subscript of each variable in the conjunction is the same,

the counterfactual probability simply reduces to an interventional distribution.

Each counterfactual conjunction is associated with multiple parallel worlds, each induced by a

unique combination of subscripts that appears in the conjunction. A parallel worlds graph of the

conjunction is obtained by combining the graphs of the submodels induced by interventions such

that the latent variables are shared. The simplest version of a parallel worlds graph is a twin network

graph, contrasting two alternative worlds (Balke and Pearl,1994a,b;Avin et al.,2005). As a more

complicated example, consider the counterfactual conjunction γ=yx∧x′∧zd∧d. In simpler terms,

this conjunction states that

takes the value

under the intervention

do(X=x)

takes the value

under the intervention

do(D=d)

, and

and

take the values

x′

and

, respectively, when no

intervention took place. Importantly, this conjunction induces three distinct parallel worlds: the

non-interventional (or observed) world, a world where

was intervened on, and a world where

was intervened on. For instance, if the graph in Figure 1(a) depicts the original causal model over

the variables

and

, then Figure 1(b) shows the corresponding parallel worlds graph for

where each distinct world is represented by its own set of copies of the original variables. In Figure 1(b),

corresponds to the bidirected edge between

and

in Figure 1(a), and the other

-variables are

the individual error terms of each observed variable, that are not drawn when they have only one

child in Figure 1(a).

Note that instead of random variables, some nodes in the parallel worlds graph now depict ﬁxed

values as assigned by the interventions in the conjunction. This is a crucial aspect when d-separation

statements are considered between counterfactual variables in the parallel worlds graph, as a backdoor

path through a ﬁxed value is not open. Furthermore, not every variable is necessarily unique in a

parallel worlds graph, making it possible to obtain misleading results if d-separation is used to infer

conditional independence relations between counterfactual variables. For instance, if we consider

the counterfactual variables

and

in a causal model whose diagram is the graph shown in

Figure 1(a), then

is independent of

given

, even though

is not d-separated from

in the

corresponding parallel worlds graph of Figure 1(b). This conditional independence holds because

and

are in fact the same counterfactual variable. To overcome this problem, the parallel worlds

graph must be further reﬁned into the counterfactual graph where every variable is unique, which we

will discuss in the following sections in more detail. For causal diagrams and counterfactual graphs,

V(G)

denotes the set of observable random variables not ﬁxed by interventions and

v(G)

denotes the

corresponding set of value assignments.

The following operations are deﬁned for counterfactual conjunctions and sets of counterfactual

variables:

sub(·)

returns the set of subscripts,

var(·)

returns the set of (non-counterfactual) variables,

and

ev(·)

returns the set of values (either ﬁxed by intervention or observed). For example, consider

again the conjunction

γ=yx∧x′∧zd∧d

. Now,

sub(γ) = {x

var(γ) = {Y

and

ev(γ) = {y

x′

. Finally,

val(·)

is the value assigned to a given counterfactual variable, e.g.,

val(yx) = y

. The notation

yx..

denotes a counterfactual variable derived from

with the value

assignment yin a submodel Mx∪zwhere Z⊆V\Xis arbitrary.

The symbol

P∗

is used to denote the set of all interventional distributions of a causal model

over

a set of observed variables V, i.e.,

P∗={Px|xis any value assignment of X⊆V}

In the following sections, we consider identiﬁability of counterfactual queries in terms of

P∗

. In essence,

this means that a counterfactual probability distribution

P(γ)

is identiﬁable if it can be expressed

using purely interventional and observational probabilities of the given causal model.

Example on identiﬁability of a counterfactual query

We consider the identiﬁability of the conditional counterfactual query

P(yx|zx∧x′)

from

P∗

in the

graph depicted in Figure 2. This graph could for instance depict the effect of an applicant’s education

(

) on work experience (

) and a potential hiring decision (

) by a company. Our counterfactual query

could then consider the statement “what is the probability to be hired if the applicant’s education level

was changed to

, given that their work experience under the same intervention was

and when in

reality their education level was

x′

”. In this example, we will not rely on any identiﬁability algorithms.

Instead, we can derive a formula for the counterfactual query as follows:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1IdentifyingCounterfactualQuerieswiththeRPackagecfidbySanttuTikkaAbstractIntheframeworkofstructuralcausalmodels,counterfactualqueriesdescribeeventsthatconcernmultiplealternativestatesofthesystemunderstudy.Counterfactualqueriesoftentaketheformof“whatif”typequestionssuchas“wouldanapplicanthavebeenhire...

展开>> 收起<<

1 Identifying Counterfactual Queries with the R Package cfid.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Identifying Counterfactual Queries with the R Package cfid

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: