Sample-Specific Root Causal Inference with Latent Variables
Eric V. Strobl, Thomas A. Lasko
Abstract
Root causal analysis seeks to identify the set of initial perturbations that induce an unwanted out-
come. In prior work, we defined sample-specific root causes of disease using exogenous error terms
that predict a diagnosis in a structural equation model. We rigorously quantified predictivity using
Shapley values. However, the associated algorithms for inferring root causes assume no latent con-
founding. We relax this assumption by permitting confounding among the predictors. We then
introduce a corresponding procedure called Extract Errors with Latents (EEL) for recovering the
error terms up to contamination by vertices on certain paths under the linear non-Gaussian acyclic
model. EEL also identifies the smallest sets of dependent errors for fast computation of the Shap-
ley values. The algorithm bypasses the hard problem of estimating the underlying causal graph
in both cases. Experiments highlight the superior accuracy and robustness of EEL relative to its
predecessors.
Keywords: causal inference, root cause, confounding, LiNGAM
1. Introduction
Causal inference refers to the process of inferring causal relations from data. Most scientists identify
causal relations by conducting randomized controlled trials (RCTs). RCTs can nevertheless fail to
distinguish between a cause and a root cause of disease, or the initial perturbation to an otherwise
healthy system that ultimately induces a diagnostic label. Identifying root causes is critical for
(a) understanding disease mechanisms and (b) discovering drug targets that eliminate disease at its
onset in a biological pathway.
Consider for example the directed graph in Figure 1 (a), where vertices in Xrepresent random
variables and directed edges their direct causal relations; we have Xi→Xjwhen Xidirectly causes
Xj. The lightning bolt in the figure denotes an exogenous perturbation of the root cause X2, such
as a virus, mutation or physical injury. This perturbation in turn affects many downstream variables,
such as {X3, X4}, ultimately causing symptoms {X5, X6}and physicians to label a patient with
a diagnosis D= 1 indicating disease. The causes of Dinclude X1, . . . , X6, but we only seek to
identify the root cause X2that may lie arbitrarily far upstream from Din the general case.
Identifying root causes is further complicated by the existence of complex disease, where each
patient may have multiple root causes, and root causes may differ between patients even within the
same diagnostic category. The disease may also only affect certain tissues or cells in the body. We
therefore more specifically seek to identify sample-specific root causes, where a sample may denote
an arbitrary unit of granularity such as a patient, tissue or cell. Identifying sample-specific root
causes has the potential to help experimentalists rapidly identify interventions that target the very
beginnings of disease unique to each patient.
The above intuitive idea of a sample-specific root cause nevertheless lacks a rigorous mathe-
matical definition. This in turn hinders the development of principled algorithms designed for their
automated detection. As a result, we explicitly defined sample-specific root causes of disease as
the error terms in a structural equation model that predict a diagnostic label in prior work (Strobl
arXiv:2210.15340v1 [stat.ML] 27 Oct 2022