CausalVAE and CCGM focus on causal discovery concurrently with simulation (i.e. reconstruction
error-based training). But in many real-world applications, a causal model is available or readily hy-
pothesized. It is often of interest to test various causal model hypotheses not only for in-distribution
(ID) test data performance, but for generalization to out-of-distribution (OOD) test data. Thus we
propose CSHTEST and CSVHTEST, which are causally constrained architectures that forgo struc-
tural causal discovery (but not the functional approximation) for causal hypothesis testing. Com-
bined with comprehensive non-random dataset splits to test generalization to non-overlapping dis-
tributions, we allow for a systematic way to test structural causal hypotheses and use those models
to generate synthetic data outside training distributions.
2 Background
2.1 Causality and Model Hypothesis Testing
Causality literature has detailed the benefits of interventions and counterfactual modeling once a
causal model is known. Given a structural prior, a causal model can tell us what parameters are
identifiable from observational data alone, subject to a no-confounders and conditioning criterion
determined by d-separation rules [1]. Because the structural priors are not known to be ground truth,
we assume a more deterministic functional form and we can make no assumptions about identifia-
bility [8]. Instead, we rely on deep neural networks to approximate the functional relationships and
use empirical results to demonstrate the reliability of this method to compare structural hypotheses
in low-data environments.
Structural causal priors are primarily about the ordering and absence of connections between vari-
ables. It is the absence of a certain edge that prevents information flow, reducing the likelihood that
spurious connections are learned within the training dataset distribution. Thus, when comparing our
architecture to traditional deep learning prediction and generative models, we show how hypothe-
sized causal models might perform worse when testing within the same distribution as the training
data, but drastically improve generalization performance when splitting the test and train distribu-
tions to have less overlap. This effect is seen the most in small datasets where traditional deep
learning methods, absent causal priors, can “memorize” spurious patterns in the data and vastly
overfit the training distribution [9].
Our architectures explore the use of the causal layer, provided with priors, as a hypothesis-testing
space. Both CSHTEST and CSVHTEST accept non-parametric (structural only, no functional-form
or parameters) causal priors as a binary Structural Causal Model (SCM) and use deep learning to ap-
proximate the functional relationships that minimize a means-squared reconstruction error (MSE).
Our empirical results show the benefits of testing structural priors using these architectures to estab-
lish a baseline for comparison where stronger causal assumptions cannot be satisfied.
3 Causal Hypothesis Gen and Variational Model
3.1 Causal Hypothesis Testing with CSHTEST
Our model CSHTEST, uses a similar causal layer as in both CCGM and CausalVAE [6, 7]. The
causal layer consists of a structural prior matrix Sfollowed by non-linear functions defined by
MLPs. We define the structural prior S∈ {0,1}d×dso that Sis the sum of a DAG term and a
diagonal term:
S=A
|{z}
DAG
+D
|{z}
diag.
(1)
Arepresents a DAG adjacency matrix, usually referred to as the causal structural model in literature,
and Dhas 1 on the diagonal for exogenous variables and 0 if endogenous. Then, given tabular inputs
x
x
x∈Rd,Sij is an indicator determining whether variable iis a parent of variable j.
From the structural prior S, each of the input variables is “selected” to be parents of output variables
through a Hadamard product with the features x
x
x. For each output variable, its parents are passed
through a non-linear ηfully connected neural-network. The ηnetworks are trained as general func-
tion approximators, learning to approximate the relationships between parent & child nodes:
ˆ
x
x
xi=ηi(Si◦x
x
x)(2)
2