
2
related to the variables in the system is captured by assuming a joint probability distribution over its
latent variables. The functional relationships of the model induce a joint probability distribution over
the observed variables. The SCM framework also incorporates the notion of external interventions
symbolically via the do-operator, and a graphical representation of counterfactual scenarios via parallel
worlds graphs (Avin et al.,2005;Shpitser and Pearl,2007,2008).
One of the fundamental problems of causal inference is the so-called identifiability problem,
especially the identifiability of interventional distributions. Using the SCM framework and do-
calculus, it is sometimes possible to uniquely represent an interventional distribution using only
the observed joint probability distribution of the model before the intervention took place. Such
interventional distributions are called identifiable. More generally, we say that a causal query is
identifiable, if it can be uniquely represented using the available data. In most identifiability problems,
the available data consists of causal quantities on levels below the query in the ladder of causation, but
the levels also sometimes overlap, (e.g., Bareinboim and Pearl,2012;Tikka and Karvanen,2019;Lee
et al.,2019). The identifiability problem of interventional distributions, and many other interventional
identifiability problems have been solved by providing a sound and complete identification algorithm
(e.g., Shpitser and Pearl,2006a;Huang and Valtorta,2006;Lee et al.,2019;Kivva et al.,2022).
Software for causal inference is becoming increasingly prominent. For R, a comprehensive
overview of the state-of-the-art is provided by the recently launched task view on Causal Inference
on the Comprehensive R Archive Network (CRAN). Out of the packages listed in this task view,
the Counterfactual (Chen et al.,2020) and WhatIf (Stoll et al.,2020) packages are directly linked
to counterfactual inference, but the focus of these packages is estimation and they do not consider
the identifiability of counterfactual queries. The R6causal (Karvanen,2022) package can be used to
simulate data from counterfactual scenarios in a causal model. R packages most closely related to
causal identifiability problems are the causaleffect (Tikka and Karvanen,2017), dosearch (Tikka et al.,
2021), and dagitty (Textor et al.,2017).
We present the first implementation of the counterfactual identifiability algorithms of Shpitser and
Pearl (2007) (see also Shpitser and Pearl,2008) as the R package cfid (counterfactual identification).
The cfid package also provides a user-friendly interface for defining causal diagrams and the package
is compatible with other major R packages for causal identifiability problems such as causaleffect,
dosearch and dagitty by supporting graph formats used by these packages as inputs.
The paper is organized as follows. Section 2.2 introduces the notation, core concepts and definitions,
and provides an example on manual identification of a counterfactual query without relying on the
identifiability algorithms. Section 2.3 presents the algorithms implemented in cfid and demonstrates
their functionality via examples by tracing their operation line by line. Section 2.4 demonstrates the
usage of the cfid package in practice. Section 2.5 concludes the paper with a summary.
Notation and definitions
We follow the notation used by Shpitser and Pearl (2008) and we assume the reader to be familiar with
standard graph theoretic concepts such as ancestral relations between vertices and d-separation. We
use capital letters to denote random variables and lower-case letters to denote their value assignments.
Bold letters are used to denote sets of random variables and counterfactual variables. We associate the
vertices of graphs with their respective random variables and value assignments in the underlying
causal models. In figures, observed variables of graphs are denoted by circles, variables fixed by
interventions are denoted by squares, and latent unobserved variables are denoted by dashed circles
when explicitly included and by bidirected edges when the corresponding latent variable has two
observed children. Latent variables with only one child, which are called error terms, are not shown
for clarity.
Astructural causal model is a tuple
M= (U
,
V
,
F
,
P(u))
where
U
is a set of unobserved random
variables,
V
is a set of
n
observed random variables,
F
is a set of
n
functions such that each function
fi
is a mapping from
U∪V\ {Vi}
to
Vi
and such that it is possible to represent the set
V
as function of
U
.
P(u)
is a joint probability distribution over
U
. The causal model also defines its causal diagram
G
.
Each
Vi∈V
corresponds to a vertex in
G
, and there is a directed edge from each
Vj∈U∪V\ {Vi}
to
Vi
. We restrict our attention to recursive causal models in this paper, meaning models that induce an
acyclic causal diagram.
Acounterfactual variable
Yx
denotes the variable
Y
in the submodel
Mx
obtained from
M
by
forcing the random variables
X
to take the values
x
(often denoted by the do-operator as
do(X=x)
or
simply
do(x)
). The distribution of
Yx
in the submodel
Mx
is called the interventional distribution of
Y
and it is denoted by
Px(y)
. However, if we wish to consider multiple counterfactual variables that
originate from different interventions, we must extend our notation to counterfactual conjunctions.
Counterfactual conjunctions are constructed from value assignments of counterfactual variables, and