Generative Augmented Flow Networks Ling Pan1Dinghuai Zhang1Aaron Courville1Longbo Huang2Yoshua Bengio1 1Mila Université de Montréal2Tsinghua University

2025-05-06 0 0 5.89MB 18 页 10玖币
侵权投诉
Generative Augmented Flow Networks
Ling Pan1Dinghuai Zhang1Aaron Courville1Longbo Huang2Yoshua Bengio1
1Mila, Université de Montréal 2Tsinghua University
Abstract
The Generative Flow Network (Bengio et al.,2021b, GFlowNet) is a probabilistic framework where an agent
learns a stochastic policy for object generation, such that the probability of generating an object is proportional
to a given reward function. Its effectiveness has been shown in discovering high-quality and diverse solutions,
compared to reward-maximizing reinforcement learning-based methods. Nonetheless, GFlowNets only learn
from rewards of the terminal states, which can limit its applicability. Indeed, intermediate rewards play
a critical role in learning, for example from intrinsic motivation to provide intermediate feedback even in
particularly challenging sparse reward tasks. Inspired by this, we propose Generative Augmented Flow
Networks (GAFlowNets), a novel learning framework to incorporate intermediate rewards into GFlowNets.
We specify intermediate rewards by intrinsic motivation to tackle the exploration problem in sparse reward
environments. GAFlowNets can leverage edge-based and state-based intrinsic rewards in a joint way to
improve exploration. Based on extensive experiments on the GridWorld task, we demonstrate the effectiveness
and efficiency of GAFlowNet in terms of convergence, performance, and diversity of solutions. We further
show that GAFlowNet is scalable to a more complex and large-scale molecule generation domain, where it
achieves consistent and significant performance improvement.
1 Introduction
Deep reinforcement learning (RL) has achieved significant progress in recent years with particular
success in games (Mnih et al.,2015,Silver et al.,2016,Vinyals et al.,2019). RL methods applied to
the setting where a reward is only given at the end (i.e., terminal states) typically aim at maximizing
that reward function for learning the optimal policy. However, diversity of the generated states is
desirable in a wide range of practical scenarios including molecule generation (Bengio et al.,2021a),
biological sequence design (Jain et al.,2022b), recommender systems (Kunaver and Požrl,2017),
dialogue systems (Zhang et al.,2020), etc. For example, in molecule generation, the reward function
used in in-silico simulations can be uncertain and imperfect itself (compared to the more expensive
in-vivo experiments). Therefore, it is not sufficient to only search the solution that maximizes the
return. Instead, it is desired that we sample many high-reward candidates, which can be achieved
by sampling them proportionally to the reward of each terminal state.
Interestingly, GFlowNets (Bengio et al.,2021a,b) learn a stochastic policy to sample compos-
ite objects x∈ X with probability proportional to the return R(x). The learning paradigm of
GFlowNets is different from other RL methods, as it is explicitly aiming at modeling the diversity
in the target distribution, i.e., all the modes of the reward function. This makes it natural for prac-
tical applications where the model should discover objects that are both interesting and diverse,
which is a focus of previous GFlowNet works (Bengio et al.,2021a,b,Jain et al.,2022b,Malkin et al.,
2022).
penny.ling.pan@gmail.com
1
arXiv:2210.03308v1 [cs.LG] 7 Oct 2022
Yet, GFlowNets only learn from the reward of the terminal state, and do not consider interme-
diate rewards, which can limit its applicability, especially in more general RL settings. Rewards
play a critical role in learning (Silver et al.,2021). The tremendous success of RL largely depends
on the reward signals that provide intermediate feedback. Even in environments with sparse re-
wards, RL agents can motivate themselves for efficient exploration by intrinsic motivation, which
augments the sparse extrinsic learning signal with a dense intrinsic reward at each step. Our focus
in this paper is precisely on introducing such intermediate intrinsic rewards in GFlowNets, since
they can be applied even in settings where the extrinsic reward is sparse (say non-zero only on a
few terminal states).
Inspired by this missing element of GFlowNets, we propose a new GFlowNet learning frame-
work that takes intermediate feedback signals into account to provide an exploration incentive
during training. The notion of flow in GFlowNets (Bengio et al.,2021a,b) refers to a marginalized
quantity that sums rewards over all downstream terminal states following a given state, while shar-
ing that reward with other states leading to the same terminal states. Apart from the existing flows
in the network, we introduce augmented flows as intermediate rewards. Our new framework is
well-suited for sparse reward tasks by considering intrinsic motivation as intermediate rewards,
where the training of GFlowNet can get trapped in a few modes, since it may be difficult for it to
discover new modes based on those it visited (Bengio et al.,2021b).
We first propose an edge-based augmented flow, based on the incorporation of an intrinsic
reward at each transition. However, we find that although it improves learning efficiency, it only
performs local exploration and still lacks sufficient exploration ability to drive the agent to visit
solutions with zero rewards. On the other hand, we find that incorporating intermediate rewards
in a state-based manner (Bengio et al.,2021b) can result in slower convergence and large bias em-
pirically, although it can explore more broadly. Therefore, we propose a joint way to take both
edge-based and state-based augmented flows into account. Our method can improve the diversity
of solutions and learning efficiency by reaping the best from both worlds. Extensive experiments on
the GridWorld and molecule domains that are already used to benchmark GFlowNets corroborate
the effectiveness of our proposed framework.
The main contributions of this paper are summarized as follows:
We propose a novel GFlowNet learning framework, dubbed Generative Augmented Flow
Networks (GAFlowNet), to incorporate intermediate rewards, which are represented by aug-
mented flows in the flow network.
We specify intermediate rewards by intrinsic motivation to deal with the exploration of state
space for GFlowNets in sparse reward tasks. We theoretically prove that our augmented ob-
jective asymptotically yields an unbiased solution to the original formulation.
We conduct extensive experiments on the GridWorld domain, demonstrating the effective-
ness of our method in terms of convergence, diversity, and performance. Our method is also
general, being applicable to different types of GFlowNets. We further extend our method to
the larger-scale and more challenging molecule generation task, where our method achieves
consistent and substantial improvements over strong baselines.
2 Background
Consider a directed acyclic graph (DAG) G= (S,A), where Sdenotes the state space, and A
represents the action space, which is a subset of S × S. We denote the vertex s0∈ S to be the initial
state with no incoming edges, while the vertex sfwithout outgoing edges is called the sink state,
2
and state-action pairs correspond to edges. The goal for GFlowNets is to learn a stochastic policy
πthat can construct discrete objects x∈ X with probability proportional to the reward function
R:X R0,i.e.,π(x)R(x). GFlowNets construct objects sequentially, where each step adds
an element to the construction. We call the resulting sequence of state transitions from the initial
state to a terminal state τ= (s0 · · · sn)a trajectory, where τ∈ T with Tdenoting the set
of trajectories. Bengio et al. (2021a) define a trajectory flow F:T R0. Let F(s) = Pτ3sF(τ)
define a state flow for any state s, and F(ss0) = Pτ3ss0F(τ)defines the edge flow for any
edge ss0. The trajectory flow induces a probability measure P(τ) = F(τ)
Z, where Z=Pτ∈T F(τ)
denotes the total flow. We then define the corresponding forward policy PF(s0|s) = F(ss0)
F(s)and the
backward policy PB(s|s0) = F(ss0)
F(s0). The flows can be considered as the amount of water flowing
through edges (like pipes) or states (like tees connecting pipes) (Malkin et al.,2022), with R(x)the
amount of water through terminal state x, and PF(s0|s)the relative amount of water flowing in
edges outgoing from s.
2.1 GFlowNets training criterion
We call a flow consistent if it satisfies the flow matching constraint for all internal states s,i.e.,
Ps00 sF(s00 s) = F(s) = Pss0F(ss0), which means that the incoming flows equal the
outgoing flows. Bengio et al. (2021a) prove that for a consistent flow Fwhere the terminal flow is
set to be the reward, the forward policy can sample objects xwith probability proportional to R(x).
Flow matching (FM). Bengio et al. (2021a) propose to approximate the edge flow by a model
Fθ(s,s0)parameterized by θfollowing the FM objective, i.e.,LFM(s) = (log P(s00 s)∈A Fθ(s00,s)
log P(ss0)∈A Fθ(s,s0))2for non-terminal states. At terminal states, a similar objective encourages
the incoming flow to match the corresponding reward. The objective is optimized using trajectories
sampled from a training policy πwith full support such as a tempered version of PFθor a mixture
of PFθwith a uniform policy U,i.e.,πθ= (1 )PFθ+·U, This is similar to -greedy and entropy-
regularized strategies in RL to improve exploration. Bengio et al. (2021a) prove that if we reach
a global minimum of the expected loss function and the training policy πθhas full support, then
GFlowNet samples from the target distribution.
Detailed balance (DB). Bengio et al. (2021b) propose the DB objective to avoid the computa-
tionally expensive summing operation over the parents or children of states. For learning based
on DB, we train a neural network with a state flow model Fθ, a forward policy model PFθ(·|s), and
a backward policy model PBθ(·|s)parameterized by θ. The optimization objective is to minimize
LDB(s,s0) = (log(Fθ(s)PFθ(s0|s)) log(Fθ(s0)PBθ(s|s0)))2. It also samples from the target distribu-
tion if a global minimum of the expected loss is reached and πθhas full support.
Trajectory balance (TB). Malkin et al. (2022) propose the TB objective for faster credit
assignment and learning over longer trajectories. The loss function for TB is LTB(τ) =
(log(ZθQn1
t=0 PFθ(st+1|st)) log(R(x)Qn1
t=0 PB(st|st+1)))2, where Zθis a learnable parameter.
3 Related Work
GFlowNets. Since the proposal of GFlowNets (Bengio et al.,2021a), there has been an increasing
interest in improving (Bengio et al.,2021b,Malkin et al.,2022), understanding, and applying this
framework to practical scenarios. It is a general-purpose high-level probabilistic inference frame-
work, and induces fruitful applications (Deleu et al.,2022,Jain et al.,2022a,Zhang et al.,2022a,b).
However, previous works only consider learning based on the terminal reward, which can make
3
it difficult to provide a good training signal for intermediate states, especially when the reward is
sparse (i.e., significantly non-zero in only a tiny fraction of the terminal states).
Reinforcement learning (RL). Different from GFlowNets that aim to sample proportionally
to the reward function, RL learns a reward-maximization policy. Although introducing entropy
regularization to RL (Attias,2003,Haarnoja et al.,2017,2018,Ziebart,2010) can improve diversity,
this is limited to tree-structured DAGs. This is because it could only sample a terminal state xin
proportion to the sum of rewards over all trajectories leading to x. It can fail on general (non-tree)
DAGs (Bengio et al.,2021a) for which the same terminal state xcan be obtained with a potentially
large number of trajectories (and a very different number of trajectories for different x’s).
Intrinsic motivation. There has been a line of research to incorporate intrinsic motiva-
tion (Burda et al.,2018,Pathak et al.,2017,Zhang et al.,2021) for improving exploration in RL.
Yet, such ideas have not been explored with GFlowNets because the current mathematical frame-
work of GFlowNets only allows for terminal rewards, unlike the standard RL frameworks. This
deficiency as well as the potential of introducing intrinsic intermediate rewards motivates this pa-
per.
4 Generative Augmented Flow Networks
Figure 1: Comparison
of GFlowNet and our
augmented (GAFlowNet)
method in Gridworld
with sparse rewards.
The potential difficulty in learning only from the terminal reward is re-
lated to the challenge of sparse rewards in RL, where most states do
not provide an informative reward. We demonstrate the sparse reward
problem for GFlowNets and reveal interesting findings based on the
GridWorld task (as shown in Figure 4) with sparse rewards. Specifi-
cally, the agent only receives a reward of +1 only when it reaches one
of the 3goals located around the corners of the world (except the start-
ing state corner) with size H×H(with H∈ {64,128}), and the reward
is 0otherwise. A more detailed description of the task can be found
in Section 5.1. We evaluate the number of modes discovered by the
GFlowNet trained with TB, following Bengio et al. (2021a). As sum-
marized in Figure 1, GFlowNet training can get trapped in a subset of
the modes. Therefore, it remains a critical challenge for GFlowNets to
efficiently learn when the reward signal is sparse and non-informative.
On the other hand, there has been recent progress with intrinsic
motivation methods (Burda et al.,2018,Pathak et al.,2017) to improve exploration of RL algorithms,
where the agent learns from both a sparse extrinsic reward and a dense intrinsic bonus at each
step. Building on this, we aim to address the exploration challenge of GFlowNets by enabling
intermediate rewards in GFlowNets and thus intrinsic rewards.
We now propose our learning framework, which is dubbed Generative Augmented Flow Net-
work (GAFlowNet), to take intermediate rewards into consideration.
4.1 Edge-based intermediate reward augmentation
We start our derivation from the flow matching consistency constraint, to take advantage of the
insights brought by the water flow metaphor as discussed in Section 2. By incorporating interme-
diate rewards r(stst+1)for transitions from states stto st+1 into the flow matching constraint,
4
摘要:

GenerativeAugmentedFlowNetworksLingPan1DinghuaiZhang1AaronCourville1LongboHuang2YoshuaBengio11Mila,UniversitédeMontréal2TsinghuaUniversityAbstractTheGenerativeFlowNetwork(Bengioetal.,2021b,GFlowNet)isaprobabilisticframeworkwhereanagentlearnsastochasticpolicyforobjectgeneration,suchthattheprobabilit...

展开>> 收起<<
Generative Augmented Flow Networks Ling Pan1Dinghuai Zhang1Aaron Courville1Longbo Huang2Yoshua Bengio1 1Mila Université de Montréal2Tsinghua University.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:5.89MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注