GFlowOut Dropout with Generative Flow Networks

2025-05-06 0 0 1.12MB 15 页 10玖币

侵权投诉

GFlowOut: Dropout with Generative Flow Networks

Dianbo Liu 1 2 Moksh Jain 1 3 Bonaventure F. P. Dossou 145 Qianli Shen 6Salem Lahlou 1 3 Anirudh Goyal 7

Nikolay Malkin 1 3 Chris C. Emezue 1 8 Dinghuai Zhang 1 3

Nadhir Hassen 1 3 Xu Ji 1 3 Kenji Kawaguchi 6Yoshua Bengio 139

Abstract

Bayesian inference offers principled tools to

tackle many critical problems with modern neural

networks such as poor calibration and general-

ization, and data inefﬁciency. However, scaling

Bayesian inference to large architectures is chal-

lenging and requires restrictive approximations.

Monte Carlo Dropout has been widely used as a

relatively cheap way to approximate inference and

estimate uncertainty with deep neural networks.

Traditionally, the dropout mask is sampled inde-

pendently from a ﬁxed distribution. Recent re-

search shows that the dropout mask can be seen

as a latent variable, which can be inferred with

variational inference. These methods face two im-

portant challenges: (a) the posterior distribution

over masks can be highly multi-modal which can

be difﬁcult to approximate with standard varia-

tional inference and (b) it is not trivial to fully

utilize sample-dependent information and correla-

tion among dropout masks to improve posterior

estimation. In this work, we propose GFlowOut

to address these issues. GFlowOut leverages the

recently proposed probabilistic framework of Gen-

erative Flow Networks (GFlowNets) to learn the

posterior distribution over dropout masks. We

empirically demonstrate that GFlowOut results in

predictive distributions that generalize better to

out-of-distribution data and provide uncertainty

estimates which lead to better performance in

downstream tasks.

Mila Quebec AI Institute

Broad Institute of MIT and

Harvard

University of Montreal

McGill University

Lelapa

National University of Singapore

Google DeepMind

Technical University of Munich

CIFAR AI Chair. Correspon-

dence to: Dianbo Liu

dianbo.liu@mila.quebec

, Moksh Jain

<moksh.jain@mila.quebec>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

1. Introduction

A key shortcoming of modern deep neural networks is that

they are often overconﬁdent about their predictions, espe-

cially when there is a distributional shift between train and

test dataset (Daxberger et al.,2021;Nguyen et al.,2015;

Guo et al.,2017). In risk-sensitive scenarios such as clin-

ical practice and drug discovery, where mistakes can be

extremely costly, it is important that models provide predic-

tions with reliable uncertainty estimates (Bhatt et al.,2021).

Bayesian inference offers principled tools to model the pa-

rameters of neural networks as random variables, placing

a prior on them and inferring their posterior given some

observed data (MacKay,1992;Neal,2012). The posterior

captures the uncertainty in the predictions of the model and

also serves as an effective regularization strategy resulting

in improved generalization (Wilson & Izmailov,2020;Lotﬁ

et al.,2022). In practice, exact Bayesian inference is often

intractable and existing Bayesian deep learning methods

rely on assumptions that result in posteriors that are less

expressive and can provide poorly calibrated uncertainty

estimates (Ovadia et al.,2019;Fort et al.,2019;Foong et al.,

2020;Daxberger et al.,2021). In addition, even with several

approximations, Bayesian deep learning methods are often

signiﬁcantly more computationally expensive and slower to

train compared to non-Bayesian methods (Kuleshov et al.,

2018;Boluki et al.,2020).

Gal and Ghahramani (2016) show that deep neural networks

with dropout perform approximate Bayesian inference and

approximate the posterior of a deep Gaussian process (Dami-

anou & Lawrence,2013). One can obtain samples from this

predictive distribution by taking multiple forward passes

through the neural network with independently sampled

dropout masks. Due to its simplicity and minimal computa-

tional overhead, dropout has since been used as a method to

estimate uncertainty and improve robustness in neural net-

works. Different variants of dropout have been proposed and

can be interpreted as different variational approximations to

model the posterior over the neural network parameters (Ba

& Frey,2013;Kingma et al.,2015;Gal et al.,2017;Ghiasi

et al.,2018;Fan et al.,2021;Pham & Le,2021).

There are a few major challenges in approximating the

arXiv:2210.12928v3 [cs.LG] 24 Jun 2023

GFlowOut: Dropout with Generative Flow Networks

Figure 1.

In this work, we propose a Generative Flow Net-

work (GFlowNet) based binary dropout mask generator which

we refer to as GFlowOut. Purple squares are GFlowNet-

based dropout mask generators parameterized as multi-layer

perceptrons.

zi,l

refers to dropout masks for data point in-

dexed by

at layer

of the model.

hi,l

refers to activations of

the model at layer

given input

q(·)

are auxiliary varia-

tional functions used and adapted only during model training,

in which the posterior distribution over dropout masks is

conditioned implicitly on input covariates (

) and directly

on the label (

) of the data point to make the estimation

easier.

p(·)

are mask generation functions used at test time,

which are only conditioned on

and trained by minimizing

the Kullback–Leibler(KL) divergence with

q(·)

. In addition,

both

q(·)

and

p(·)

conditions explicitly on dropout masks of

all previous layers.

Bayesian posterior over model parameters using dropout:

(1) the multimodal nature of the posterior distribution makes

it difﬁcult to approximate with standard variational infer-

ence (Gal & Ghahramani,2016;Le Folgoc et al.,2021),

which assumes factorized priors; (2) dropout masks are

discrete objects making gradient-based optimization difﬁ-

cult (Boluki et al.,2020); (3) variational inference methods

can suffer from high gradient variance resulting in opti-

mization instability (Kingma et al.,2015); (4) modeling

dependence between dropout masks from different layers is

non-trivial.

The recently proposed Generative Flow Networks

(GFlowNets) (Bengio et al.,2021a;b) frame the problem of

generating discrete objects as a control problem based on the

sequential construction of discrete components. GFlowNets

learn probabilistic policies that sample objects propor-

tional to a reward function (or exp(-energy)). They have

demonstrated better generalization to multimodal distribu-

tions (Nica et al.,2022) and have lower gradient variance

compared with policy gradient-based variational methods

(Malkin et al.,2023), making it an interesting choice for

posterior inference for dropout.

Contributions. In this work, to address the limitations

of standard variational inference, we develop a GFlowNet-

based binary dropout mask generator which we refer to as

GFlowOut, to estimate the posterior distribution of binary

dropout masks. GFlowOut generates dropout masks for a

layer, conditioned on masks generated for the previous layer,

therefore accounting for inter-layer dropout dependence.

Furthermore, the GFlowOut estimator can be conditioned

on the data point: GFlowOut improves posterior estima-

tion here by utilizing both input covariates and labels in

the training set of supervised learning tasks via an auxiliary

variational function. To investigate the quality of the poste-

rior distribution learned by GFlowOut, we design empirical

experiments, including evaluating robustness to distribution

shift during inference, detecting out-of-distribution exam-

ples with uncertainty estimates, and transfer learning, using

both benchmark datasets and a real-world clinical dataset.

2. Related work

2.1. Dropout as a Bayesian approximation

Deep learning tools have shown tremendous power in differ-

ent applications. However, traditional deep learning tools

lack mechanisms to capture the uncertainty, which is of cru-

cial importance in many ﬁelds. Uncertainty quantiﬁcation

(UQ) is studied extensively as a fundamental problem of

deep learning and a large number of Bayesian deep learning

tools have emerged in recent years. For example, Gal &

Ghahramani (2016) showed that casting dropout in deep

learning model training is an approximation of Bayesian

inference in deep Gaussian processes and allows uncertainty

estimation without extra computational cost. Kingma et al.

(2015) proposed variational dropout, where a dropout pos-

terior over parameters is learned by treating dropout reg-

ularization as approximate inference in deep models. Gal

et al. (2017) developed a continuous relaxation of discrete

dropout masks to improve uncertainty estimation, especially

GFlowOut: Dropout with Generative Flow Networks

in reinforcement learning settings. Lee et al. (2020) intro-

duced “meta-dropout”, which involves an additional global

term shared across all data points during inference to im-

prove generalization. Xie et al. (2019) replaced the hard

dropout mask following a Bernoulli distribution with the

soft mask following a beta distribution and conducted the op-

timization using a stochastic gradient variational Bayesian

algorithm to control the dropout rate. Boluki et al. (2020)

combined a model-agnostic dropout scheme with variational

auto-encoders (VAEs), resulting in semi-implicit VAE mod-

els. Instead of using mean-ﬁeld family for variational in-

ference, Nguyen et al. (2021) utilized a structured represen-

tation of multiplicative Gaussian noise for better posterior

estimation. More recently, Fan et al. (2021) developed “con-

textual dropout”, which optimizes variational objectives in

a sample-dependent manner and, to the best of our knowl-

edge, is the closest approach to GFlowOut in the litera-

ture. GFlowOut differs from contextual dropout in several

aspects. First, both methods take trainable priors into ac-

count, but GFlowNet also takes into account priors that

depend on the input covariate of each data point. Second,

the variational posterior of contextual dropout only depends

on the input covariate (

), while in GFlowOut, the varia-

tional posterior is also conditioned on the label

, which

provides more information for training. Third, within each

neural network layer, the mask of contextual dropout is con-

ditioned on previous masks implicitly, while the mask of

GFlowOut is conditioned on previous masks explicitly by

directly feeding previous masks as inputs into the generator,

which improves the training process. Finally, instead of a

REINFORCE-based gradient estimator used for contextual

dropout training, GFlowOut employs powerful GFlowNets

for the variational posterior.

2.2. Generative ﬂow networks

Generative ﬂow networks (GFlowNets) (Bengio et al.,

2021a;b) are a family of probabilistic models that amor-

tizes sampling discrete compositional objects proportionally

to a given unnormalized density function. GFlowNets learn

a stochastic policy to construct objects through a sequence

of actions akin to deep reinforcement learning (Sutton &

Barto,2018). GFlowNets are trained so as to make the

likelihood of reaching a terminating state proportional to

the reward. Recent works have shown close connections

of GFlowNets to other generative models (Zhang et al.,

2022a) and to hierarchical variational inference (Malkin

et al.,2023). GFlowNets achieved great empirical success in

learning energy-based models (Zhang et al.,2022b), small-

molecule generation (Bengio et al.,2021a;Nica et al.,2022;

Malkin et al.,2022;Madan et al.,2023;Pan et al.,2023), bi-

ological sequence generation (Malkin et al.,2022;Jain et al.,

2022;Madan et al.,2023), and structure learning (Deleu

et al.,2022). Several training objectives have been pro-

posed for GFlowNets, including Flow Matching (FM) (Ben-

gio et al.,2021a), Detailed Balance (DB) (Bengio et al.,

2021b), Trajectory Balance (TB) (Malkin et al.,2022), and

the more recent Sub-Trajectory Balance (SubTB) (Madan

et al.,2023). In this work, we use the Trajectory Balance

(TB) objective.

3. Method

In this section, we deﬁne the problem setting and mathe-

matical notations used in this study, as well as describe the

proposed method, GFlowOut, for dropout mask generation

in detail.

3.1. Background and notation

Dropout. In a vanilla feed-forward neural network (MLP)

with

layers, each layer of the model has weight matrix

and bias vector

. It takes as input activations

hl−1

from previous layer with layer index

l−1

, and computes

as output

hl=σ(wlhl−1+bl)

where

is a non-linear ac-

tivation function. Dropout consists of dropping out units

from the output of a layer. Formally this can be described

as applying a sampled binary mask

zl∼p(zl)

on the output

of the layer

hl=zl◦σ(wlhl−1+bl)

, at each layer in the

model. In regular random dropout,

is a collection of i.i.d.

Bernoulli(r)

variables, where

is a ﬁxed parameter for all

the layers. Recently, several approaches have been proposed

to learn

p(zl)

along with the model parameters. In these

approaches,

is viewed either as latent variables or part

of the model parameters. We consider two variants for our

proposed method: GFlowOut where the dropout masks

are viewed as sample dependent latent variables, and ID-

GFlowOut, which generates masks in a sample independent

manner where

is viewed as a part of the model parame-

ters shared across all samples. Next, we brieﬂy introduce

GFlowNets and describe how they model the dropout masks

zgiven the data.

GFlowNets. Let

G= (S,A)

be a directed acyclic graph

(DAG) where the vertices

s∈ S

are states, including a

special initial state

with no incoming edges, and directed

edges

(s→s′)∈ A

are actions.

X ⊆ S

denotes the termi-

nal states, with no outgoing edges. A complete trajectory

τ= (s0→. . . si−1→si· · · → z)∈ T

is a sequence

of states starting at

and terminating at

z∈ X

where each

(si−1→si)∈ A

. The forward policy

PF(−|s)

is a collec-

tion of distributions over the children of each non-terminal

node

s∈ S

and deﬁnes a distribution over complete trajecto-

ries,

PF(τ) = Q(si−1→si)∈τPF(si|si−1)

. We can sample

terminal states

z∈ X

by sampling trajectories following

. Let

π(x)

be the marginal likelihood of sampling ter-

minal state

π(z) = Pτ=(s0→···→z)∈T PF(τ)

. Given a

non-negative reward function

R:X → R+

, the learning

problem tackled in GFlowNets is to estimate

such that

π(z)∝R(z),∀z∈ X

. We refer the reader to Bengio

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GFlowOut:DropoutwithGenerativeFlowNetworksDianboLiu12MokshJain13BonaventureF.P.Dossou145QianliShen6SalemLahlou13AnirudhGoyal7NikolayMalkin13ChrisC.Emezue18DinghuaiZhang13NadhirHassen13XuJi13KenjiKawaguchi6YoshuaBengio139AbstractBayesianinferenceoffersprincipledtoolstotacklemanycriticalproblemswithmo...

展开>> 收起<<

GFlowOut Dropout with Generative Flow Networks.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GFlowOut Dropout with Generative Flow Networks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: