1 Identifying Counterfactual Queries with the R Package cfid

2025-04-30 0 0 294.67KB 14 页 10玖币
侵权投诉
1
Identifying Counterfactual Queries with
the R Package cfid
by Santtu Tikka
Abstract In the framework of structural causal models, counterfactual queries describe events that
concern multiple alternative states of the system under study. Counterfactual queries often take
the form of “what if” type questions such as “would an applicant have been hired if they had over
10 years of experience, when in reality they only had 5 years of experience?” Such questions and
counterfactual inference in general are crucial, for example when addressing the problem of fairness
in decision-making. Because counterfactual events contain contradictory states of the world, it is
impossible to conduct a randomized experiment to address them without making several restrictive
assumptions. However, it is sometimes possible to identify such queries from observational and
experimental data by representing the system under study as a causal model, and the available data as
symbolic probability distributions. Shpitser and Pearl (2007) constructed two algorithms, called ID*
and IDC*, for identifying counterfactual queries and conditional counterfactual queries, respectively.
These two algorithms are analogous to the ID and IDC algorithms by Shpitser and Pearl (2006a,b) for
identification of interventional distributions, which were implemented in R by Tikka and Karvanen
(2017) in the causaleffect package. We present the R package cfid that implements the ID* and IDC*
algorithms. Identification of counterfactual queries and the features of cfid are demonstrated via
examples.
Introduction
Pearl’s ladder of causation (or causal hierarchy) consists of three levels: association, intervention,
and counterfactual (Pearl,2009). These levels describe a hierarchy of problems with increasing
conceptual and formal difficulty. On the first and lowest level, inference on associations is based
entirely on observed data in the form of questions such as “what is the probability that an event
occurs?” or “what is the correlation between two variables”. On the second level, the inference
problems are related to manipulations of the system under study such as “what is the probability
of an event if we change the value of one variable in the system”. Questions on the intervention
level cannot be answered using tools of the association level, because simply observing a change in
a system is not the same as intervening on the system. Randomized controlled trials are the gold
standard for studying the effects of interventions, because they enable the researcher to account for
confounding factors between the treatment and the outcome and to carry out the intervention in
practice. However, there are often practical limitations that make it difficult, expensive, or impossible
to conduct a randomized experiment. The third and highest level is the counterfactual level. Typically,
counterfactual statements compare the real world, where an action was taken or some event was
observed, to an alternative hypothetical scenario, where a possibly different action was taken, or a
different event was observed. Counterfactuals are often challenging to understand even conceptually
due this notion of contradictory events in alternative worlds, and such alternatives need not be limited
to only two. In general, questions on the counterfactual level cannot be answered by relying solely
on the previous levels: no intervention or association is able to capture the notion of alternative
hypothetical worlds.
While counterfactual statements can be challenging, they are a core part of our everyday thinking
and discourse. Importantly, counterfactuals often consider retrospective questions about the state of
the world, such as “would an applicant have been hired if they had more work experience”. This
kind of retrospection is crucial when fair treatment of individuals is considered in hiring, healthcare,
receiving loans or insurance, etc., with regards to protected attributes, especially when the goal
is automated decision-making. Statistical approaches to fairness are insufficient in most contexts,
such as in scenarios analogous to the well-known Simpson’s paradox, but routinely resolved using
the framework of causal inference. In some cases, even interventional notions of fairness may be
insufficient, necessitating counterfactual fairness (Kusner et al.,2017;Zhang and Bareinboim,2018).
The structural causal model (SCM) framework of Pearl provides a formal approach to causal
inference of interventional and counterfactual causal queries (Pearl,2009). An SCM represents the
system of interest in two ways, First, the causal relationships are depicted by a directed acyclic
graph (DAG) whose vertices correspond to variables under study and whose edges depict the direct
functional causal relationships between the variables. Typically, only some of these variables are
observed and the remaining variables are considered latent, corresponding either to confounders
between multiple variables or individual random errors of single variables. Second, the uncertainty
arXiv:2210.14745v2 [stat.ME] 22 Nov 2023
2
related to the variables in the system is captured by assuming a joint probability distribution over its
latent variables. The functional relationships of the model induce a joint probability distribution over
the observed variables. The SCM framework also incorporates the notion of external interventions
symbolically via the do-operator, and a graphical representation of counterfactual scenarios via parallel
worlds graphs (Avin et al.,2005;Shpitser and Pearl,2007,2008).
One of the fundamental problems of causal inference is the so-called identifiability problem,
especially the identifiability of interventional distributions. Using the SCM framework and do-
calculus, it is sometimes possible to uniquely represent an interventional distribution using only
the observed joint probability distribution of the model before the intervention took place. Such
interventional distributions are called identifiable. More generally, we say that a causal query is
identifiable, if it can be uniquely represented using the available data. In most identifiability problems,
the available data consists of causal quantities on levels below the query in the ladder of causation, but
the levels also sometimes overlap, (e.g., Bareinboim and Pearl,2012;Tikka and Karvanen,2019;Lee
et al.,2019). The identifiability problem of interventional distributions, and many other interventional
identifiability problems have been solved by providing a sound and complete identification algorithm
(e.g., Shpitser and Pearl,2006a;Huang and Valtorta,2006;Lee et al.,2019;Kivva et al.,2022).
Software for causal inference is becoming increasingly prominent. For R, a comprehensive
overview of the state-of-the-art is provided by the recently launched task view on Causal Inference
on the Comprehensive R Archive Network (CRAN). Out of the packages listed in this task view,
the Counterfactual (Chen et al.,2020) and WhatIf (Stoll et al.,2020) packages are directly linked
to counterfactual inference, but the focus of these packages is estimation and they do not consider
the identifiability of counterfactual queries. The R6causal (Karvanen,2022) package can be used to
simulate data from counterfactual scenarios in a causal model. R packages most closely related to
causal identifiability problems are the causaleffect (Tikka and Karvanen,2017), dosearch (Tikka et al.,
2021), and dagitty (Textor et al.,2017).
We present the first implementation of the counterfactual identifiability algorithms of Shpitser and
Pearl (2007) (see also Shpitser and Pearl,2008) as the R package cfid (counterfactual identification).
The cfid package also provides a user-friendly interface for defining causal diagrams and the package
is compatible with other major R packages for causal identifiability problems such as causaleffect,
dosearch and dagitty by supporting graph formats used by these packages as inputs.
The paper is organized as follows. Section 2.2 introduces the notation, core concepts and definitions,
and provides an example on manual identification of a counterfactual query without relying on the
identifiability algorithms. Section 2.3 presents the algorithms implemented in cfid and demonstrates
their functionality via examples by tracing their operation line by line. Section 2.4 demonstrates the
usage of the cfid package in practice. Section 2.5 concludes the paper with a summary.
Notation and definitions
We follow the notation used by Shpitser and Pearl (2008) and we assume the reader to be familiar with
standard graph theoretic concepts such as ancestral relations between vertices and d-separation. We
use capital letters to denote random variables and lower-case letters to denote their value assignments.
Bold letters are used to denote sets of random variables and counterfactual variables. We associate the
vertices of graphs with their respective random variables and value assignments in the underlying
causal models. In figures, observed variables of graphs are denoted by circles, variables fixed by
interventions are denoted by squares, and latent unobserved variables are denoted by dashed circles
when explicitly included and by bidirected edges when the corresponding latent variable has two
observed children. Latent variables with only one child, which are called error terms, are not shown
for clarity.
Astructural causal model is a tuple
M= (U
,
V
,
F
,
P(u))
where
U
is a set of unobserved random
variables,
V
is a set of
n
observed random variables,
F
is a set of
n
functions such that each function
fi
is a mapping from
UV\ {Vi}
to
Vi
and such that it is possible to represent the set
V
as function of
U
.
P(u)
is a joint probability distribution over
U
. The causal model also defines its causal diagram
G
.
Each
ViV
corresponds to a vertex in
G
, and there is a directed edge from each
VjUV\ {Vi}
to
Vi
. We restrict our attention to recursive causal models in this paper, meaning models that induce an
acyclic causal diagram.
Acounterfactual variable
Yx
denotes the variable
Y
in the submodel
Mx
obtained from
M
by
forcing the random variables
X
to take the values
x
(often denoted by the do-operator as
do(X=x)
or
simply
do(x)
). The distribution of
Yx
in the submodel
Mx
is called the interventional distribution of
Y
and it is denoted by
Px(y)
. However, if we wish to consider multiple counterfactual variables that
originate from different interventions, we must extend our notation to counterfactual conjunctions.
Counterfactual conjunctions are constructed from value assignments of counterfactual variables, and
3
individual assignments are separated by the
symbol. For example,
yxzxx
denotes the event
that
Yx=y
,
Zx=z
and
X=x
. The probability
P(yxzxx)
is the probability of the counterfactual
event. Note that primes do not differentiate variables, instead they are used to differentiate between
values i.e.,
x
is a different value from
x
and they are both different from
x′′
but all three are value
assignments of the random variable
X
. If the subscript of each variable in the conjunction is the same,
the counterfactual probability simply reduces to an interventional distribution.
Each counterfactual conjunction is associated with multiple parallel worlds, each induced by a
unique combination of subscripts that appears in the conjunction. A parallel worlds graph of the
conjunction is obtained by combining the graphs of the submodels induced by interventions such
that the latent variables are shared. The simplest version of a parallel worlds graph is a twin network
graph, contrasting two alternative worlds (Balke and Pearl,1994a,b;Avin et al.,2005). As a more
complicated example, consider the counterfactual conjunction γ=yxxzdd. In simpler terms,
this conjunction states that
Y
takes the value
y
under the intervention
do(X=x)
,
Z
takes the value
z
under the intervention
do(D=d)
, and
X
and
D
take the values
x
and
d
, respectively, when no
intervention took place. Importantly, this conjunction induces three distinct parallel worlds: the
non-interventional (or observed) world, a world where
X
was intervened on, and a world where
D
was intervened on. For instance, if the graph in Figure 1(a) depicts the original causal model over
the variables
Y
,
X
,
Z
,
W
and
D
, then Figure 1(b) shows the corresponding parallel worlds graph for
γ
,
where each distinct world is represented by its own set of copies of the original variables. In Figure 1(b),
U
corresponds to the bidirected edge between
X
and
Y
in Figure 1(a), and the other
U
-variables are
the individual error terms of each observed variable, that are not drawn when they have only one
child in Figure 1(a).
Note that instead of random variables, some nodes in the parallel worlds graph now depict fixed
values as assigned by the interventions in the conjunction. This is a crucial aspect when d-separation
statements are considered between counterfactual variables in the parallel worlds graph, as a backdoor
path through a fixed value is not open. Furthermore, not every variable is necessarily unique in a
parallel worlds graph, making it possible to obtain misleading results if d-separation is used to infer
conditional independence relations between counterfactual variables. For instance, if we consider
the counterfactual variables
Yx
,
Dx
and
Z
in a causal model whose diagram is the graph shown in
Figure 1(a), then
Yx
is independent of
Dx
given
Z
, even though
Yx
is not d-separated from
Dx
in the
corresponding parallel worlds graph of Figure 1(b). This conditional independence holds because
Z
and
Zx
are in fact the same counterfactual variable. To overcome this problem, the parallel worlds
graph must be further refined into the counterfactual graph where every variable is unique, which we
will discuss in the following sections in more detail. For causal diagrams and counterfactual graphs,
V(G)
denotes the set of observable random variables not fixed by interventions and
v(G)
denotes the
corresponding set of value assignments.
The following operations are defined for counterfactual conjunctions and sets of counterfactual
variables:
sub(·)
returns the set of subscripts,
var(·)
returns the set of (non-counterfactual) variables,
and
ev(·)
returns the set of values (either fixed by intervention or observed). For example, consider
again the conjunction
γ=yxxzdd
. Now,
sub(γ) = {x
,
d}
,
var(γ) = {Y
,
X
,
Z
,
D}
and
ev(γ) = {y
,
x
,
x
,
z
,
d}
. Finally,
val(·)
is the value assigned to a given counterfactual variable, e.g.,
val(yx) = y
. The notation
yx..
denotes a counterfactual variable derived from
Y
with the value
assignment yin a submodel Mxzwhere ZV\Xis arbitrary.
The symbol
P
is used to denote the set of all interventional distributions of a causal model
M
over
a set of observed variables V, i.e.,
P={Px|xis any value assignment of XV}
In the following sections, we consider identifiability of counterfactual queries in terms of
P
. In essence,
this means that a counterfactual probability distribution
P(γ)
is identifiable if it can be expressed
using purely interventional and observational probabilities of the given causal model.
Example on identifiability of a counterfactual query
We consider the identifiability of the conditional counterfactual query
P(yx|zxx)
from
P
in the
graph depicted in Figure 2. This graph could for instance depict the effect of an applicant’s education
(
X
) on work experience (
Z
) and a potential hiring decision (
Y
) by a company. Our counterfactual query
could then consider the statement “what is the probability to be hired if the applicant’s education level
was changed to
x
, given that their work experience under the same intervention was
z
and when in
reality their education level was
x
”. In this example, we will not rely on any identifiability algorithms.
Instead, we can derive a formula for the counterfactual query as follows:
摘要:

1IdentifyingCounterfactualQuerieswiththeRPackagecfidbySanttuTikkaAbstractIntheframeworkofstructuralcausalmodels,counterfactualqueriesdescribeeventsthatconcernmultiplealternativestatesofthesystemunderstudy.Counterfactualqueriesoftentaketheformof“whatif”typequestionssuchas“wouldanapplicanthavebeenhire...

展开>> 收起<<
1 Identifying Counterfactual Queries with the R Package cfid.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:294.67KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注