Learning Discrete Directed Acyclic Graphs via Backpropagation Andrew J. WrenàPasquale MinerviniàæLuca FranceschiâValentina Zantedeschièàç

2025-05-02 0 0 619.56KB 15 页 10玖币
侵权投诉
Learning Discrete Directed Acyclic Graphs
via Backpropagation
Andrew J. WrenàPasquale Minerviniàæ Luca FranceschiâValentina Zantedeschièàç
àUniversity College London æUniversity of Edinburgh çInria London
âAmazon Web Services èServiceNow
andrew.wren@ntlworld.com
Abstract
Recently continuous relaxations have been proposed in order to learn Directed
Acyclic Graphs (DAGs) from data by backpropagation, instead of using combinato-
rial optimization. However, a number of techniques for fully discrete backpropaga-
tion could instead be applied. In this paper, we explore that direction and propose
DAG-DB, a framework for learning DAGs by Discrete Backpropagation. Based on
the architecture of Implicit Maximum Likelihood Estimation [I-MLE,
1
], DAG-
DB adopts a probabilistic approach to the problem, sampling binary adjacency
matrices from an implicit probability distribution. DAG-DB learns a parameter for
the distribution from the loss incurred by each sample, performing competitively
using either of two fully discrete backpropagation techniques, namely I-MLE and
Straight-Through Estimation.
1 Introduction
Aim.
Directed Acyclic Graphs (DAGs) occur in a wide range of contexts, including project man-
agement, version control systems, evolutionary biology, and Bayesian networks. Bayesian networks
have been a particularly popular subject for machine learning, including for the problem considered
in this paper, learning a Bayesian network’s DAG structure from data. This paper aims to learn DAGs
using fully-discrete backpropagation, i.e. avoiding continuous relaxations, which are usually used for
learning DAGs from data – for example, see [2–14].
Bayesian networks.
A Bayesian network associates a DAG with a random variable which has
components indexed by the DAG nodes. A directed edge
i j
between nodes
i
and
j
represents
a dependency of the random variable’s
j
component on its
i
component. Discussion of Bayesian
networks and DAGs may be found in textbooks (e.g., see [
15
17
]). To establish our terminology,
appendix A gives a very brief outline.
Contribution.
Our contribution is to show that backpropagation methods which retain the fully-
discrete nature of a DAG (i.e., that do not rely on continuous relaxations) can be used to predict
DAGs from data. We avoid the common approach of “relaxing” digraph adjacency matrices from
binary to real matrices. Instead, we sample discrete variables probabilistically and use methods of
backpropagating that do not relax the variables to be continuous.
2 Related work and background
Learning DAGs from data.
Most ways of learning DAGs from data can be classed as combinatoric
methods (see [
18
] for a review) or continuous optimisation methods (reviewed in [
19
]). Combinatoric
Work done while at UCL, and prior to joining Amazon.
arXiv:2210.15353v1 [cs.LG] 27 Oct 2022
x
ΘZ x Z˜
xMSE(x,˜
x)`= MSE(x,˜
x) + r(Z)L
r(Z)
Dpred
p(Z;Θ)
GFAS
r
fΦEp
Figure 1: Overall DAG-DB architecture, including learnable parameters and loss calculation.
methods may themselves be labelled as constraint-based, identifying the graph from conditional
independence testing, and score-based, searching the space of possible graphs using a score function
to evaluate search results [
18
]. The PC algorithm [
20
] is a well-known constraint-based method, with
a number of variant algorithms, such as PC-Stable [
21
], which reduces dependency on arbitrary node
ordering. An example of a score-based approach is the Fast Greedy Equivalence Search (FGES)
algorithm [
22
], which is an optimised version of the Greedy Equivalence Search (GES) [
23
,
24
]. The
development of NOTEARS in 2018 [
2
] for linearly-generated data signalled the start of widespread
development of continuous optimisation approaches, including the GOLEM method [
8
] which is
generally even more successful than NOTEARS for linear data with, for example, Gaussian noise.
Many other continuous methods have recently been developed [for example,
3
7
,
9
14
] and the
method presented in this paper, which uses gradient descent, may be viewed as related to these.
However, continuous methods generally adopt some approach of ‘relaxing’ the discrete property of
an edge being present or absent into a continuous variable, whereas our DAG-DB approach maintains
edges as fully discrete, binary objects, even during our method’s training phase. Our approach might
be termed a ‘probabilistic relaxation’, and another example of this can be found in the SDI [
25
]
framework, which focuses on contexts where data may be generated by interventions during training,
whereas, here, our focus is on pre-generated ‘observational’ data.
Discrete backpropagation.
A number of methods have also been developed to allow backprop-
agation without transforming discrete variables to be continuous in training. We take four such
methods as examples. Straight-through estimation [STE,
26
,
27
] ignores the derivative of a discrete
function during back-propagation and passes on the incoming gradient as if the function was the
identity function; this simple approach, as we will see, can be surprisingly effective. Score-function
estimation [SFE,
28
], also referred to as REINFORCE, rewrites the gradient of an expectation in
an expectation using the log-derivative trick, which can be then estimated using Monte Carlo meth-
ods. Black-box differentiation [BB,
29
] is a way of adjusting continuous inputs to a combinatorial
solver in order to mimic gradient descent for the discrete variable. Implicit maximum likelihood
estimation [I-MLE,
1
] incorporates noise into BB to handle discrete random variables. This allows
combinatorial solvers to be used for maximum likelihood estimation. I-MLE also makes use of both
STE and a seminal result by Domke [
30
]. We note that SDI [
25
], highlighted above, uses a particular
Bernoulli-based SFE-like form of backpropagation [31].
Perturb-and-MAP sampling.
We use the technique of perturb-and-MAP (P&M) sampling of
random variables [
32
]. This approximates a matrix distribution, say
p(Z;Θ),
with parameter
Θ,
as
p(Z;Θ)MAP(Θ+τΨ),(1)
where each element of the noise matrix
Ψ
is i.i.d. and sampled from a standard one-dimensional
distribution,
τ > 0
is a temperature, and
MAP(Θ+τΨ)
denotes the most likely value of
Z
with
respect to
p(Z;Θ+τΨ).
P&M is useful when sampling
p(Z;Θ)
directly is intractable or expensive,
and the MAP solver is cheap, and facilitates I-MLE’s approach to backpropagation.
2
3 Method
Framework.
Figure 1 gives an overview of the DAG-DB framework. Suppose we wish to learn
a DAG with
d
nodes from a dataset
XRn×d
of
n
data points
xRd.
Let
R ⊂ Rd×d
and
B ⊂ {0,1}d×d
be the sets of zero-diagonal, respectively real and binary, matrices. A learnable vector
Θ R parameterises a exponential family distribution p(Z;Θ),
p(Z;Θ) = exp (hZ,ΘiFA(Θ)) ,(2)
where
Z∈ Z ⊆ B
is a discrete matrix,
τ > 0
is a temperature, and
A(Θ)
normalizes
p(Z;Θ)
to
sum to one. Recognising the matrix form of Θand Z,eq. (2) uses the Frobenius scalar product,
hZ,ΘiF:= trZTΘ=X
ij
ZijΘij.(3)
The matrix
Z
is interpreted as the adjacency matrix of a directed graph on
d
nodes, with, because
of its zero diagonal, no self-loops. In the forward pass, samples
Z(s), s = 1, ..., S,
from
p(Z;Θ)
are taken using the P&M sampling outlined in section 2. For
Z=B,
the MAP solver sets matrix
elements of
MAP(Θ)
to be one if
Θ>0
and zero otherwise. We found empirically that it is best to
use the standard logistic distribution to generate the noise
Ψ
in eq. (1). This can also be theoretically
justified as a good choice because of the binary nature of Zs matrix elements [32].
Maximum DAG.
To go from the parameter
Θ
and a digraph
Z
to a DAG, we can consider the
digraph edges as weighted by the corresponding matrix elements of
Θ
, and then solve the associated
maximum directed acyclic subgraph problem. Solving this well-known “maximum DAG” problem
involves identifying the directed acyclic sub-graph with the maximum sum of weights. To find the
maximum DAG from
Θ
and
Z,
we devise both exact and approximate solvers. For exactly solving
the maximum DAG problem, we cast it as a constrained combinatorial optimisation problem, which
we solve using MiniZinc [
33
,
34
], a free and open-source constraint modelling language. We found it
most efficient, and necessary as the number of digraph nodes rose toward 100, to use an approximate
maximum DAG solver, “Greedy Feedback Arc Set” [GFAS,
35
], our Python implementation being
adapted from (more optimised) Java code [
36
,
37
]. Tests for 30 nodes and fewer, for which we could
use the exact MiniZinc solver, suggested that the solutions of our approximate GFAS solver were in
practice exact, or close to exact, due, most likely, to regularizing (see below)
Z
towards being a DAG.
Whatever maximum DAG solver is chosen, it can be applied either directly as part of the MAP solver,
or subsequent to the MAP solver. We adopt latter option, which has the advantage of allowing an
approach in which training is done without the maximum DAG solver, deploying it only at evaluation.
In all but one part of an experiment, that ‘evaluation only’ approach is the one we will use.
Learning
As is common for continuous optimisation methods, learning proceeds by solving the
problem of predicting values of
x
components
xj
from its values
xi
at ‘parent’ nodes with edges
i j.
As set out in appendix C, this is ensured by the graphification operation
xZ
in the context
of the linear map
fΦ.
Elsewhere, this technique usually employs a weighted adjacency matrix with
real elements (see e.g., [
2
,
6
,
8
]); but we use the binary adjacency matrix
Z
without such a relaxation.
To the extent we relax the problem, this is by treating the digraph probabilistically via the distribution
p(Z;Θ),which, in practice, becomes concentrated around the most likely adjacency matrix.
The parameters
Θ
and
Φ
are trained by feeding data points
xX
into the model batch-wise.
The mean-squared error between
˜
x
and
x
is added to a regularizing function
r(Z)
to give a loss
`
associated with the sample
Z.
The empirical mean of these losses, over the samples
Z
and the
batch members
x,
gives the batch loss
L.
Our learnt parameters
Φ
and
Θ
are then updated by
backpropagation: while for
Φ
this is standard backpropagation, a technique needs to be chosen for
discrete backpropagation from
Z
to
Θ
. We found I-MLE [
1
] and straight-through estimation [
26
,
27
]
to be useful techniques for such backpropagation; we had less success here with score-function
estimation [
28
] and black-box differentiation [
29
]. Appendix B gives details of I-MLE and straight-
through estimation as used in DAG-DB.
Regularization.
DAG-DB can employ both functional and constraint regularization. Functional
regularization is based on that of NOTEARS [2] and provided by the function
r(Z) = ρDAG rDAG (Z) + ρsp rsp(Z),(4)
3
摘要:

LearningDiscreteDirectedAcyclicGraphsviaBackpropagationAndrewJ.WrenàPasqualeMinerviniàæLucaFranceschiâValentinaZantedeschièàçàUniversityCollegeLondonæUniversityofEdinburghçInriaLondonâAmazonWebServicesèServiceNowandrew.wren@ntlworld.comAbstractRecentlycontinuousrelaxationshavebeenproposedinordertol...

展开>> 收起<<
Learning Discrete Directed Acyclic Graphs via Backpropagation Andrew J. WrenàPasquale MinerviniàæLuca FranceschiâValentina Zantedeschièàç.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:619.56KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注