GFlowOut Dropout with Generative Flow Networks

2025-05-06 0 0 1.12MB 15 页 10玖币
侵权投诉
GFlowOut: Dropout with Generative Flow Networks
Dianbo Liu 1 2 Moksh Jain 1 3 Bonaventure F. P. Dossou 145 Qianli Shen 6Salem Lahlou 1 3 Anirudh Goyal 7
Nikolay Malkin 1 3 Chris C. Emezue 1 8 Dinghuai Zhang 1 3
Nadhir Hassen 1 3 Xu Ji 1 3 Kenji Kawaguchi 6Yoshua Bengio 139
Abstract
Bayesian inference offers principled tools to
tackle many critical problems with modern neural
networks such as poor calibration and general-
ization, and data inefficiency. However, scaling
Bayesian inference to large architectures is chal-
lenging and requires restrictive approximations.
Monte Carlo Dropout has been widely used as a
relatively cheap way to approximate inference and
estimate uncertainty with deep neural networks.
Traditionally, the dropout mask is sampled inde-
pendently from a fixed distribution. Recent re-
search shows that the dropout mask can be seen
as a latent variable, which can be inferred with
variational inference. These methods face two im-
portant challenges: (a) the posterior distribution
over masks can be highly multi-modal which can
be difficult to approximate with standard varia-
tional inference and (b) it is not trivial to fully
utilize sample-dependent information and correla-
tion among dropout masks to improve posterior
estimation. In this work, we propose GFlowOut
to address these issues. GFlowOut leverages the
recently proposed probabilistic framework of Gen-
erative Flow Networks (GFlowNets) to learn the
posterior distribution over dropout masks. We
empirically demonstrate that GFlowOut results in
predictive distributions that generalize better to
out-of-distribution data and provide uncertainty
estimates which lead to better performance in
downstream tasks.
1
Mila Quebec AI Institute
2
Broad Institute of MIT and
Harvard
3
University of Montreal
4
McGill University
5
Lelapa
AI
6
National University of Singapore
7
Google DeepMind
8
Technical University of Munich
9
CIFAR AI Chair. Correspon-
dence to: Dianbo Liu
<
dianbo.liu@mila.quebec
>
, Moksh Jain
<moksh.jain@mila.quebec>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
1. Introduction
A key shortcoming of modern deep neural networks is that
they are often overconfident about their predictions, espe-
cially when there is a distributional shift between train and
test dataset (Daxberger et al.,2021;Nguyen et al.,2015;
Guo et al.,2017). In risk-sensitive scenarios such as clin-
ical practice and drug discovery, where mistakes can be
extremely costly, it is important that models provide predic-
tions with reliable uncertainty estimates (Bhatt et al.,2021).
Bayesian inference offers principled tools to model the pa-
rameters of neural networks as random variables, placing
a prior on them and inferring their posterior given some
observed data (MacKay,1992;Neal,2012). The posterior
captures the uncertainty in the predictions of the model and
also serves as an effective regularization strategy resulting
in improved generalization (Wilson & Izmailov,2020;Lotfi
et al.,2022). In practice, exact Bayesian inference is often
intractable and existing Bayesian deep learning methods
rely on assumptions that result in posteriors that are less
expressive and can provide poorly calibrated uncertainty
estimates (Ovadia et al.,2019;Fort et al.,2019;Foong et al.,
2020;Daxberger et al.,2021). In addition, even with several
approximations, Bayesian deep learning methods are often
significantly more computationally expensive and slower to
train compared to non-Bayesian methods (Kuleshov et al.,
2018;Boluki et al.,2020).
Gal and Ghahramani (2016) show that deep neural networks
with dropout perform approximate Bayesian inference and
approximate the posterior of a deep Gaussian process (Dami-
anou & Lawrence,2013). One can obtain samples from this
predictive distribution by taking multiple forward passes
through the neural network with independently sampled
dropout masks. Due to its simplicity and minimal computa-
tional overhead, dropout has since been used as a method to
estimate uncertainty and improve robustness in neural net-
works. Different variants of dropout have been proposed and
can be interpreted as different variational approximations to
model the posterior over the neural network parameters (Ba
& Frey,2013;Kingma et al.,2015;Gal et al.,2017;Ghiasi
et al.,2018;Fan et al.,2021;Pham & Le,2021).
There are a few major challenges in approximating the
1
arXiv:2210.12928v3 [cs.LG] 24 Jun 2023
GFlowOut: Dropout with Generative Flow Networks
Figure 1.
In this work, we propose a Generative Flow Net-
work (GFlowNet) based binary dropout mask generator which
we refer to as GFlowOut. Purple squares are GFlowNet-
based dropout mask generators parameterized as multi-layer
perceptrons.
zi,l
refers to dropout masks for data point in-
dexed by
i
at layer
l
of the model.
hi,l
refers to activations of
the model at layer
l
given input
xi
.
q(·)
are auxiliary varia-
tional functions used and adapted only during model training,
in which the posterior distribution over dropout masks is
conditioned implicitly on input covariates (
xi
) and directly
on the label (
yi
) of the data point to make the estimation
easier.
p(·)
are mask generation functions used at test time,
which are only conditioned on
xi
and trained by minimizing
the Kullback–Leibler(KL) divergence with
q(·)
. In addition,
both
q(·)
and
p(·)
conditions explicitly on dropout masks of
all previous layers.
Bayesian posterior over model parameters using dropout:
(1) the multimodal nature of the posterior distribution makes
it difficult to approximate with standard variational infer-
ence (Gal & Ghahramani,2016;Le Folgoc et al.,2021),
which assumes factorized priors; (2) dropout masks are
discrete objects making gradient-based optimization diffi-
cult (Boluki et al.,2020); (3) variational inference methods
can suffer from high gradient variance resulting in opti-
mization instability (Kingma et al.,2015); (4) modeling
dependence between dropout masks from different layers is
non-trivial.
The recently proposed Generative Flow Networks
(GFlowNets) (Bengio et al.,2021a;b) frame the problem of
generating discrete objects as a control problem based on the
sequential construction of discrete components. GFlowNets
learn probabilistic policies that sample objects propor-
tional to a reward function (or exp(-energy)). They have
demonstrated better generalization to multimodal distribu-
tions (Nica et al.,2022) and have lower gradient variance
compared with policy gradient-based variational methods
(Malkin et al.,2023), making it an interesting choice for
posterior inference for dropout.
Contributions. In this work, to address the limitations
of standard variational inference, we develop a GFlowNet-
based binary dropout mask generator which we refer to as
GFlowOut, to estimate the posterior distribution of binary
dropout masks. GFlowOut generates dropout masks for a
layer, conditioned on masks generated for the previous layer,
therefore accounting for inter-layer dropout dependence.
Furthermore, the GFlowOut estimator can be conditioned
on the data point: GFlowOut improves posterior estima-
tion here by utilizing both input covariates and labels in
the training set of supervised learning tasks via an auxiliary
variational function. To investigate the quality of the poste-
rior distribution learned by GFlowOut, we design empirical
experiments, including evaluating robustness to distribution
shift during inference, detecting out-of-distribution exam-
ples with uncertainty estimates, and transfer learning, using
both benchmark datasets and a real-world clinical dataset.
2. Related work
2.1. Dropout as a Bayesian approximation
Deep learning tools have shown tremendous power in differ-
ent applications. However, traditional deep learning tools
lack mechanisms to capture the uncertainty, which is of cru-
cial importance in many fields. Uncertainty quantification
(UQ) is studied extensively as a fundamental problem of
deep learning and a large number of Bayesian deep learning
tools have emerged in recent years. For example, Gal &
Ghahramani (2016) showed that casting dropout in deep
learning model training is an approximation of Bayesian
inference in deep Gaussian processes and allows uncertainty
estimation without extra computational cost. Kingma et al.
(2015) proposed variational dropout, where a dropout pos-
terior over parameters is learned by treating dropout reg-
ularization as approximate inference in deep models. Gal
et al. (2017) developed a continuous relaxation of discrete
dropout masks to improve uncertainty estimation, especially
2
GFlowOut: Dropout with Generative Flow Networks
in reinforcement learning settings. Lee et al. (2020) intro-
duced “meta-dropout”, which involves an additional global
term shared across all data points during inference to im-
prove generalization. Xie et al. (2019) replaced the hard
dropout mask following a Bernoulli distribution with the
soft mask following a beta distribution and conducted the op-
timization using a stochastic gradient variational Bayesian
algorithm to control the dropout rate. Boluki et al. (2020)
combined a model-agnostic dropout scheme with variational
auto-encoders (VAEs), resulting in semi-implicit VAE mod-
els. Instead of using mean-field family for variational in-
ference, Nguyen et al. (2021) utilized a structured represen-
tation of multiplicative Gaussian noise for better posterior
estimation. More recently, Fan et al. (2021) developed “con-
textual dropout”, which optimizes variational objectives in
a sample-dependent manner and, to the best of our knowl-
edge, is the closest approach to GFlowOut in the litera-
ture. GFlowOut differs from contextual dropout in several
aspects. First, both methods take trainable priors into ac-
count, but GFlowNet also takes into account priors that
depend on the input covariate of each data point. Second,
the variational posterior of contextual dropout only depends
on the input covariate (
x
), while in GFlowOut, the varia-
tional posterior is also conditioned on the label
y
, which
provides more information for training. Third, within each
neural network layer, the mask of contextual dropout is con-
ditioned on previous masks implicitly, while the mask of
GFlowOut is conditioned on previous masks explicitly by
directly feeding previous masks as inputs into the generator,
which improves the training process. Finally, instead of a
REINFORCE-based gradient estimator used for contextual
dropout training, GFlowOut employs powerful GFlowNets
for the variational posterior.
2.2. Generative flow networks
Generative flow networks (GFlowNets) (Bengio et al.,
2021a;b) are a family of probabilistic models that amor-
tizes sampling discrete compositional objects proportionally
to a given unnormalized density function. GFlowNets learn
a stochastic policy to construct objects through a sequence
of actions akin to deep reinforcement learning (Sutton &
Barto,2018). GFlowNets are trained so as to make the
likelihood of reaching a terminating state proportional to
the reward. Recent works have shown close connections
of GFlowNets to other generative models (Zhang et al.,
2022a) and to hierarchical variational inference (Malkin
et al.,2023). GFlowNets achieved great empirical success in
learning energy-based models (Zhang et al.,2022b), small-
molecule generation (Bengio et al.,2021a;Nica et al.,2022;
Malkin et al.,2022;Madan et al.,2023;Pan et al.,2023), bi-
ological sequence generation (Malkin et al.,2022;Jain et al.,
2022;Madan et al.,2023), and structure learning (Deleu
et al.,2022). Several training objectives have been pro-
posed for GFlowNets, including Flow Matching (FM) (Ben-
gio et al.,2021a), Detailed Balance (DB) (Bengio et al.,
2021b), Trajectory Balance (TB) (Malkin et al.,2022), and
the more recent Sub-Trajectory Balance (SubTB) (Madan
et al.,2023). In this work, we use the Trajectory Balance
(TB) objective.
3. Method
In this section, we define the problem setting and mathe-
matical notations used in this study, as well as describe the
proposed method, GFlowOut, for dropout mask generation
in detail.
3.1. Background and notation
Dropout. In a vanilla feed-forward neural network (MLP)
with
L
layers, each layer of the model has weight matrix
wl
and bias vector
bl
. It takes as input activations
hl1
from previous layer with layer index
l1
, and computes
as output
hl=σ(wlhl1+bl)
where
σ
is a non-linear ac-
tivation function. Dropout consists of dropping out units
from the output of a layer. Formally this can be described
as applying a sampled binary mask
zlp(zl)
on the output
of the layer
hl=zlσ(wlhl1+bl)
, at each layer in the
model. In regular random dropout,
zl
is a collection of i.i.d.
Bernoulli(r)
variables, where
r
is a fixed parameter for all
the layers. Recently, several approaches have been proposed
to learn
p(zl)
along with the model parameters. In these
approaches,
z
is viewed either as latent variables or part
of the model parameters. We consider two variants for our
proposed method: GFlowOut where the dropout masks
z
are viewed as sample dependent latent variables, and ID-
GFlowOut, which generates masks in a sample independent
manner where
z
is viewed as a part of the model parame-
ters shared across all samples. Next, we briefly introduce
GFlowNets and describe how they model the dropout masks
zgiven the data.
GFlowNets. Let
G= (S,A)
be a directed acyclic graph
(DAG) where the vertices
s∈ S
are states, including a
special initial state
s0
with no incoming edges, and directed
edges
(ss)∈ A
are actions.
X ⊆ S
denotes the termi-
nal states, with no outgoing edges. A complete trajectory
τ= (s0. . . si1si· · · z)∈ T
in
G
is a sequence
of states starting at
s0
and terminating at
z∈ X
where each
(si1si)∈ A
. The forward policy
PF(−|s)
is a collec-
tion of distributions over the children of each non-terminal
node
s∈ S
and defines a distribution over complete trajecto-
ries,
PF(τ) = Q(si1si)τPF(si|si1)
. We can sample
terminal states
z∈ X
by sampling trajectories following
PF
. Let
π(x)
be the marginal likelihood of sampling ter-
minal state
x
,
π(z) = Pτ=(s0→···→z)∈T PF(τ)
. Given a
non-negative reward function
R:X R+
, the learning
problem tackled in GFlowNets is to estimate
PF
such that
π(z)R(z),z∈ X
. We refer the reader to Bengio
3
摘要:

GFlowOut:DropoutwithGenerativeFlowNetworksDianboLiu12MokshJain13BonaventureF.P.Dossou145QianliShen6SalemLahlou13AnirudhGoyal7NikolayMalkin13ChrisC.Emezue18DinghuaiZhang13NadhirHassen13XuJi13KenjiKawaguchi6YoshuaBengio139AbstractBayesianinferenceoffersprincipledtoolstotacklemanycriticalproblemswithmo...

展开>> 收起<<
GFlowOut Dropout with Generative Flow Networks.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.12MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注