
GFlowOut: Dropout with Generative Flow Networks
in reinforcement learning settings. Lee et al. (2020) intro-
duced “meta-dropout”, which involves an additional global
term shared across all data points during inference to im-
prove generalization. Xie et al. (2019) replaced the hard
dropout mask following a Bernoulli distribution with the
soft mask following a beta distribution and conducted the op-
timization using a stochastic gradient variational Bayesian
algorithm to control the dropout rate. Boluki et al. (2020)
combined a model-agnostic dropout scheme with variational
auto-encoders (VAEs), resulting in semi-implicit VAE mod-
els. Instead of using mean-field family for variational in-
ference, Nguyen et al. (2021) utilized a structured represen-
tation of multiplicative Gaussian noise for better posterior
estimation. More recently, Fan et al. (2021) developed “con-
textual dropout”, which optimizes variational objectives in
a sample-dependent manner and, to the best of our knowl-
edge, is the closest approach to GFlowOut in the litera-
ture. GFlowOut differs from contextual dropout in several
aspects. First, both methods take trainable priors into ac-
count, but GFlowNet also takes into account priors that
depend on the input covariate of each data point. Second,
the variational posterior of contextual dropout only depends
on the input covariate (
x
), while in GFlowOut, the varia-
tional posterior is also conditioned on the label
y
, which
provides more information for training. Third, within each
neural network layer, the mask of contextual dropout is con-
ditioned on previous masks implicitly, while the mask of
GFlowOut is conditioned on previous masks explicitly by
directly feeding previous masks as inputs into the generator,
which improves the training process. Finally, instead of a
REINFORCE-based gradient estimator used for contextual
dropout training, GFlowOut employs powerful GFlowNets
for the variational posterior.
2.2. Generative flow networks
Generative flow networks (GFlowNets) (Bengio et al.,
2021a;b) are a family of probabilistic models that amor-
tizes sampling discrete compositional objects proportionally
to a given unnormalized density function. GFlowNets learn
a stochastic policy to construct objects through a sequence
of actions akin to deep reinforcement learning (Sutton &
Barto,2018). GFlowNets are trained so as to make the
likelihood of reaching a terminating state proportional to
the reward. Recent works have shown close connections
of GFlowNets to other generative models (Zhang et al.,
2022a) and to hierarchical variational inference (Malkin
et al.,2023). GFlowNets achieved great empirical success in
learning energy-based models (Zhang et al.,2022b), small-
molecule generation (Bengio et al.,2021a;Nica et al.,2022;
Malkin et al.,2022;Madan et al.,2023;Pan et al.,2023), bi-
ological sequence generation (Malkin et al.,2022;Jain et al.,
2022;Madan et al.,2023), and structure learning (Deleu
et al.,2022). Several training objectives have been pro-
posed for GFlowNets, including Flow Matching (FM) (Ben-
gio et al.,2021a), Detailed Balance (DB) (Bengio et al.,
2021b), Trajectory Balance (TB) (Malkin et al.,2022), and
the more recent Sub-Trajectory Balance (SubTB) (Madan
et al.,2023). In this work, we use the Trajectory Balance
(TB) objective.
3. Method
In this section, we define the problem setting and mathe-
matical notations used in this study, as well as describe the
proposed method, GFlowOut, for dropout mask generation
in detail.
3.1. Background and notation
Dropout. In a vanilla feed-forward neural network (MLP)
with
L
layers, each layer of the model has weight matrix
wl
and bias vector
bl
. It takes as input activations
hl−1
from previous layer with layer index
l−1
, and computes
as output
hl=σ(wlhl−1+bl)
where
σ
is a non-linear ac-
tivation function. Dropout consists of dropping out units
from the output of a layer. Formally this can be described
as applying a sampled binary mask
zl∼p(zl)
on the output
of the layer
hl=zl◦σ(wlhl−1+bl)
, at each layer in the
model. In regular random dropout,
zl
is a collection of i.i.d.
Bernoulli(r)
variables, where
r
is a fixed parameter for all
the layers. Recently, several approaches have been proposed
to learn
p(zl)
along with the model parameters. In these
approaches,
z
is viewed either as latent variables or part
of the model parameters. We consider two variants for our
proposed method: GFlowOut where the dropout masks
z
are viewed as sample dependent latent variables, and ID-
GFlowOut, which generates masks in a sample independent
manner where
z
is viewed as a part of the model parame-
ters shared across all samples. Next, we briefly introduce
GFlowNets and describe how they model the dropout masks
zgiven the data.
GFlowNets. Let
G= (S,A)
be a directed acyclic graph
(DAG) where the vertices
s∈ S
are states, including a
special initial state
s0
with no incoming edges, and directed
edges
(s→s′)∈ A
are actions.
X ⊆ S
denotes the termi-
nal states, with no outgoing edges. A complete trajectory
τ= (s0→. . . si−1→si· · · → z)∈ T
in
G
is a sequence
of states starting at
s0
and terminating at
z∈ X
where each
(si−1→si)∈ A
. The forward policy
PF(−|s)
is a collec-
tion of distributions over the children of each non-terminal
node
s∈ S
and defines a distribution over complete trajecto-
ries,
PF(τ) = Q(si−1→si)∈τPF(si|si−1)
. We can sample
terminal states
z∈ X
by sampling trajectories following
PF
. Let
π(x)
be the marginal likelihood of sampling ter-
minal state
x
,
π(z) = Pτ=(s0→···→z)∈T PF(τ)
. Given a
non-negative reward function
R:X → R+
, the learning
problem tackled in GFlowNets is to estimate
PF
such that
π(z)∝R(z),∀z∈ X
. We refer the reader to Bengio
3