
3 Method
Framework.
Figure 1 gives an overview of the DAG-DB framework. Suppose we wish to learn
a DAG with
d
nodes from a dataset
X∈Rn×d
of
n
data points
x∈Rd.
Let
R ⊂ Rd×d
and
B ⊂ {0,1}d×d
be the sets of zero-diagonal, respectively real and binary, matrices. A learnable vector
Θ∈ R parameterises a exponential family distribution p(Z;Θ),
p(Z;Θ) = exp (hZ,ΘiF/τ −A(Θ)) ,(2)
where
Z∈ Z ⊆ B
is a discrete matrix,
τ > 0
is a temperature, and
A(Θ)
normalizes
p(Z;Θ)
to
sum to one. Recognising the matrix form of Θand Z,eq. (2) uses the Frobenius scalar product,
hZ,ΘiF:= trZTΘ=X
ij
ZijΘij.(3)
The matrix
Z
is interpreted as the adjacency matrix of a directed graph on
d
nodes, with, because
of its zero diagonal, no self-loops. In the forward pass, samples
Z(s), s = 1, ..., S,
from
p(Z;Θ)
are taken using the P&M sampling outlined in section 2. For
Z=B,
the MAP solver sets matrix
elements of
MAP(Θ)
to be one if
Θ>0
and zero otherwise. We found empirically that it is best to
use the standard logistic distribution to generate the noise
Ψ
in eq. (1). This can also be theoretically
justified as a good choice because of the binary nature of Z’s matrix elements [32].
Maximum DAG.
To go from the parameter
Θ
and a digraph
Z
to a DAG, we can consider the
digraph edges as weighted by the corresponding matrix elements of
Θ
, and then solve the associated
maximum directed acyclic subgraph problem. Solving this well-known “maximum DAG” problem
involves identifying the directed acyclic sub-graph with the maximum sum of weights. To find the
maximum DAG from
Θ
and
Z,
we devise both exact and approximate solvers. For exactly solving
the maximum DAG problem, we cast it as a constrained combinatorial optimisation problem, which
we solve using MiniZinc [
33
,
34
], a free and open-source constraint modelling language. We found it
most efficient, and necessary as the number of digraph nodes rose toward 100, to use an approximate
maximum DAG solver, “Greedy Feedback Arc Set” [GFAS,
35
], our Python implementation being
adapted from (more optimised) Java code [
36
,
37
]. Tests for 30 nodes and fewer, for which we could
use the exact MiniZinc solver, suggested that the solutions of our approximate GFAS solver were in
practice exact, or close to exact, due, most likely, to regularizing (see below)
Z
towards being a DAG.
Whatever maximum DAG solver is chosen, it can be applied either directly as part of the MAP solver,
or subsequent to the MAP solver. We adopt latter option, which has the advantage of allowing an
approach in which training is done without the maximum DAG solver, deploying it only at evaluation.
In all but one part of an experiment, that ‘evaluation only’ approach is the one we will use.
Learning
As is common for continuous optimisation methods, learning proceeds by solving the
problem of predicting values of
x
components
xj
from its values
xi
at ‘parent’ nodes with edges
i j.
As set out in appendix C, this is ensured by the graphification operation
xZ
in the context
of the linear map
fΦ.
Elsewhere, this technique usually employs a weighted adjacency matrix with
real elements (see e.g., [
2
,
6
,
8
]); but we use the binary adjacency matrix
Z
without such a relaxation.
To the extent we relax the problem, this is by treating the digraph probabilistically via the distribution
p(Z;Θ),which, in practice, becomes concentrated around the most likely adjacency matrix.
The parameters
Θ
and
Φ
are trained by feeding data points
x∈X
into the model batch-wise.
The mean-squared error between
˜
x
and
x
is added to a regularizing function
r(Z)
to give a loss
`
associated with the sample
Z.
The empirical mean of these losses, over the samples
Z
and the
batch members
x,
gives the batch loss
L.
Our learnt parameters
Φ
and
Θ
are then updated by
backpropagation: while for
Φ
this is standard backpropagation, a technique needs to be chosen for
discrete backpropagation from
Z
to
Θ
. We found I-MLE [
1
] and straight-through estimation [
26
,
27
]
to be useful techniques for such backpropagation; we had less success here with score-function
estimation [
28
] and black-box differentiation [
29
]. Appendix B gives details of I-MLE and straight-
through estimation as used in DAG-DB.
Regularization.
DAG-DB can employ both functional and constraint regularization. Functional
regularization is based on that of NOTEARS [2] and provided by the function
r(Z) = ρDAG rDAG (Z) + ρsp rsp(Z),(4)
3