Learning Discrete Directed Acyclic Graphs via Backpropagation Andrew J. WrenàPasquale MinerviniàæLuca FranceschiâValentina Zantedeschièàç

2025-05-02 0 0 619.56KB 15 页 10玖币

侵权投诉

Learning Discrete Directed Acyclic Graphs

via Backpropagation

Andrew J. WrenàPasquale Minerviniàæ Luca Franceschiâ∗Valentina Zantedeschièàç

àUniversity College London æUniversity of Edinburgh çInria London

âAmazon Web Services èServiceNow

andrew.wren@ntlworld.com

Abstract

Recently continuous relaxations have been proposed in order to learn Directed

Acyclic Graphs (DAGs) from data by backpropagation, instead of using combinato-

rial optimization. However, a number of techniques for fully discrete backpropaga-

tion could instead be applied. In this paper, we explore that direction and propose

DAG-DB, a framework for learning DAGs by Discrete Backpropagation. Based on

the architecture of Implicit Maximum Likelihood Estimation [I-MLE,

], DAG-

DB adopts a probabilistic approach to the problem, sampling binary adjacency

matrices from an implicit probability distribution. DAG-DB learns a parameter for

the distribution from the loss incurred by each sample, performing competitively

using either of two fully discrete backpropagation techniques, namely I-MLE and

Straight-Through Estimation.

1 Introduction

Aim.

Directed Acyclic Graphs (DAGs) occur in a wide range of contexts, including project man-

agement, version control systems, evolutionary biology, and Bayesian networks. Bayesian networks

have been a particularly popular subject for machine learning, including for the problem considered

in this paper, learning a Bayesian network’s DAG structure from data. This paper aims to learn DAGs

using fully-discrete backpropagation, i.e. avoiding continuous relaxations, which are usually used for

learning DAGs from data – for example, see [2–14].

Bayesian networks.

A Bayesian network associates a DAG with a random variable which has

components indexed by the DAG nodes. A directed edge

i j

between nodes

and

represents

a dependency of the random variable’s

component on its

component. Discussion of Bayesian

networks and DAGs may be found in textbooks (e.g., see [

–

]). To establish our terminology,

appendix A gives a very brief outline.

Contribution.

Our contribution is to show that backpropagation methods which retain the fully-

discrete nature of a DAG (i.e., that do not rely on continuous relaxations) can be used to predict

DAGs from data. We avoid the common approach of “relaxing” digraph adjacency matrices from

binary to real matrices. Instead, we sample discrete variables probabilistically and use methods of

backpropagating that do not relax the variables to be continuous.

2 Related work and background

Learning DAGs from data.

Most ways of learning DAGs from data can be classed as combinatoric

methods (see [

] for a review) or continuous optimisation methods (reviewed in [

]). Combinatoric

∗Work done while at UCL, and prior to joining Amazon.

arXiv:2210.15353v1 [cs.LG] 27 Oct 2022

ΘZ x Z˜

xMSE(x,˜

x)`= MSE(x,˜

x) + r(Z)L

r(Z)

Dpred

∼p(Z;Θ)

GFAS

fΦEp

Figure 1: Overall DAG-DB architecture, including learnable parameters and loss calculation.

methods may themselves be labelled as constraint-based, identifying the graph from conditional

independence testing, and score-based, searching the space of possible graphs using a score function

to evaluate search results [

]. The PC algorithm [

] is a well-known constraint-based method, with

a number of variant algorithms, such as PC-Stable [

], which reduces dependency on arbitrary node

ordering. An example of a score-based approach is the Fast Greedy Equivalence Search (FGES)

algorithm [

], which is an optimised version of the Greedy Equivalence Search (GES) [

]. The

development of NOTEARS in 2018 [

] for linearly-generated data signalled the start of widespread

development of continuous optimisation approaches, including the GOLEM method [

] which is

generally even more successful than NOTEARS for linear data with, for example, Gaussian noise.

Many other continuous methods have recently been developed [for example,

–

] and the

method presented in this paper, which uses gradient descent, may be viewed as related to these.

However, continuous methods generally adopt some approach of ‘relaxing’ the discrete property of

an edge being present or absent into a continuous variable, whereas our DAG-DB approach maintains

edges as fully discrete, binary objects, even during our method’s training phase. Our approach might

be termed a ‘probabilistic relaxation’, and another example of this can be found in the SDI [

]

framework, which focuses on contexts where data may be generated by interventions during training,

whereas, here, our focus is on pre-generated ‘observational’ data.

Discrete backpropagation.

A number of methods have also been developed to allow backprop-

agation without transforming discrete variables to be continuous in training. We take four such

methods as examples. Straight-through estimation [STE,

] ignores the derivative of a discrete

function during back-propagation and passes on the incoming gradient as if the function was the

identity function; this simple approach, as we will see, can be surprisingly effective. Score-function

estimation [SFE,

], also referred to as REINFORCE, rewrites the gradient of an expectation in

an expectation using the log-derivative trick, which can be then estimated using Monte Carlo meth-

ods. Black-box differentiation [BB,

] is a way of adjusting continuous inputs to a combinatorial

solver in order to mimic gradient descent for the discrete variable. Implicit maximum likelihood

estimation [I-MLE,

] incorporates noise into BB to handle discrete random variables. This allows

combinatorial solvers to be used for maximum likelihood estimation. I-MLE also makes use of both

STE and a seminal result by Domke [

]. We note that SDI [

], highlighted above, uses a particular

Bernoulli-based SFE-like form of backpropagation [31].

Perturb-and-MAP sampling.

We use the technique of perturb-and-MAP (P&M) sampling of

random variables [

]. This approximates a matrix distribution, say

p(Z;Θ),

with parameter

Θ,

p(Z;Θ)∼MAP(Θ+τΨ),(1)

where each element of the noise matrix

is i.i.d. and sampled from a standard one-dimensional

distribution,

τ > 0

is a temperature, and

MAP(Θ+τΨ)

denotes the most likely value of

with

respect to

p(Z;Θ+τΨ).

P&M is useful when sampling

p(Z;Θ)

directly is intractable or expensive,

and the MAP solver is cheap, and facilitates I-MLE’s approach to backpropagation.

3 Method

Framework.

Figure 1 gives an overview of the DAG-DB framework. Suppose we wish to learn

a DAG with

nodes from a dataset

X∈Rn×d

data points

x∈Rd.

Let

R ⊂ Rd×d

and

B ⊂ {0,1}d×d

be the sets of zero-diagonal, respectively real and binary, matrices. A learnable vector

Θ∈ R parameterises a exponential family distribution p(Z;Θ),

p(Z;Θ) = exp (hZ,ΘiF/τ −A(Θ)) ,(2)

where

Z∈ Z ⊆ B

is a discrete matrix,

τ > 0

is a temperature, and

A(Θ)

normalizes

p(Z;Θ)

sum to one. Recognising the matrix form of Θand Z,eq. (2) uses the Frobenius scalar product,

hZ,ΘiF:= trZTΘ=X

ZijΘij.(3)

The matrix

is interpreted as the adjacency matrix of a directed graph on

nodes, with, because

of its zero diagonal, no self-loops. In the forward pass, samples

Z(s), s = 1, ..., S,

from

p(Z;Θ)

are taken using the P&M sampling outlined in section 2. For

Z=B,

the MAP solver sets matrix

elements of

MAP(Θ)

to be one if

Θ>0

and zero otherwise. We found empirically that it is best to

use the standard logistic distribution to generate the noise

in eq. (1). This can also be theoretically

justiﬁed as a good choice because of the binary nature of Z’s matrix elements [32].

Maximum DAG.

To go from the parameter

and a digraph

to a DAG, we can consider the

digraph edges as weighted by the corresponding matrix elements of

, and then solve the associated

maximum directed acyclic subgraph problem. Solving this well-known “maximum DAG” problem

involves identifying the directed acyclic sub-graph with the maximum sum of weights. To ﬁnd the

maximum DAG from

and

we devise both exact and approximate solvers. For exactly solving

the maximum DAG problem, we cast it as a constrained combinatorial optimisation problem, which

we solve using MiniZinc [

], a free and open-source constraint modelling language. We found it

most efﬁcient, and necessary as the number of digraph nodes rose toward 100, to use an approximate

maximum DAG solver, “Greedy Feedback Arc Set” [GFAS,

], our Python implementation being

adapted from (more optimised) Java code [

]. Tests for 30 nodes and fewer, for which we could

use the exact MiniZinc solver, suggested that the solutions of our approximate GFAS solver were in

practice exact, or close to exact, due, most likely, to regularizing (see below)

towards being a DAG.

Whatever maximum DAG solver is chosen, it can be applied either directly as part of the MAP solver,

or subsequent to the MAP solver. We adopt latter option, which has the advantage of allowing an

approach in which training is done without the maximum DAG solver, deploying it only at evaluation.

In all but one part of an experiment, that ‘evaluation only’ approach is the one we will use.

Learning

As is common for continuous optimisation methods, learning proceeds by solving the

problem of predicting values of

components

from its values

at ‘parent’ nodes with edges

i j.

As set out in appendix C, this is ensured by the graphiﬁcation operation

xZ

in the context

of the linear map

fΦ.

Elsewhere, this technique usually employs a weighted adjacency matrix with

real elements (see e.g., [

]); but we use the binary adjacency matrix

without such a relaxation.

To the extent we relax the problem, this is by treating the digraph probabilistically via the distribution

p(Z;Θ),which, in practice, becomes concentrated around the most likely adjacency matrix.

The parameters

and

are trained by feeding data points

x∈X

into the model batch-wise.

The mean-squared error between

and

is added to a regularizing function

r(Z)

to give a loss

associated with the sample

The empirical mean of these losses, over the samples

and the

batch members

gives the batch loss

Our learnt parameters

and

are then updated by

backpropagation: while for

this is standard backpropagation, a technique needs to be chosen for

discrete backpropagation from

. We found I-MLE [

] and straight-through estimation [

]

to be useful techniques for such backpropagation; we had less success here with score-function

estimation [

] and black-box differentiation [

]. Appendix B gives details of I-MLE and straight-

through estimation as used in DAG-DB.

Regularization.

DAG-DB can employ both functional and constraint regularization. Functional

regularization is based on that of NOTEARS [2] and provided by the function

r(Z) = ρDAG rDAG (Z) + ρsp rsp(Z),(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningDiscreteDirectedAcyclicGraphsviaBackpropagationAndrewJ.WrenàPasqualeMinerviniàæLucaFranceschiâValentinaZantedeschièàçàUniversityCollegeLondonæUniversityofEdinburghçInriaLondonâAmazonWebServicesèServiceNowandrew.wren@ntlworld.comAbstractRecentlycontinuousrelaxationshavebeenproposedinordertol...

展开>> 收起<<

Learning Discrete Directed Acyclic Graphs via Backpropagation Andrew J. WrenàPasquale MinerviniàæLuca FranceschiâValentina Zantedeschièàç.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Discrete Directed Acyclic Graphs via Backpropagation Andrew J. WrenàPasquale MinerviniàæLuca FranceschiâValentina Zantedeschièàç

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: