Generative Augmented Flow Networks Ling Pan1Dinghuai Zhang1Aaron Courville1Longbo Huang2Yoshua Bengio1 1Mila Université de Montréal2Tsinghua University

2025-05-06 0 0 5.89MB 18 页 10玖币

侵权投诉

Generative Augmented Flow Networks

Ling Pan1∗Dinghuai Zhang1Aaron Courville1Longbo Huang2Yoshua Bengio1

1Mila, Université de Montréal 2Tsinghua University

Abstract

The Generative Flow Network (Bengio et al.,2021b, GFlowNet) is a probabilistic framework where an agent

learns a stochastic policy for object generation, such that the probability of generating an object is proportional

to a given reward function. Its effectiveness has been shown in discovering high-quality and diverse solutions,

compared to reward-maximizing reinforcement learning-based methods. Nonetheless, GFlowNets only learn

from rewards of the terminal states, which can limit its applicability. Indeed, intermediate rewards play

a critical role in learning, for example from intrinsic motivation to provide intermediate feedback even in

particularly challenging sparse reward tasks. Inspired by this, we propose Generative Augmented Flow

Networks (GAFlowNets), a novel learning framework to incorporate intermediate rewards into GFlowNets.

We specify intermediate rewards by intrinsic motivation to tackle the exploration problem in sparse reward

environments. GAFlowNets can leverage edge-based and state-based intrinsic rewards in a joint way to

improve exploration. Based on extensive experiments on the GridWorld task, we demonstrate the effectiveness

and efﬁciency of GAFlowNet in terms of convergence, performance, and diversity of solutions. We further

show that GAFlowNet is scalable to a more complex and large-scale molecule generation domain, where it

achieves consistent and signiﬁcant performance improvement.

1 Introduction

Deep reinforcement learning (RL) has achieved signiﬁcant progress in recent years with particular

success in games (Mnih et al.,2015,Silver et al.,2016,Vinyals et al.,2019). RL methods applied to

the setting where a reward is only given at the end (i.e., terminal states) typically aim at maximizing

that reward function for learning the optimal policy. However, diversity of the generated states is

desirable in a wide range of practical scenarios including molecule generation (Bengio et al.,2021a),

biological sequence design (Jain et al.,2022b), recommender systems (Kunaver and Požrl,2017),

dialogue systems (Zhang et al.,2020), etc. For example, in molecule generation, the reward function

used in in-silico simulations can be uncertain and imperfect itself (compared to the more expensive

in-vivo experiments). Therefore, it is not sufﬁcient to only search the solution that maximizes the

return. Instead, it is desired that we sample many high-reward candidates, which can be achieved

by sampling them proportionally to the reward of each terminal state.

Interestingly, GFlowNets (Bengio et al.,2021a,b) learn a stochastic policy to sample compos-

ite objects x∈ X with probability proportional to the return R(x). The learning paradigm of

GFlowNets is different from other RL methods, as it is explicitly aiming at modeling the diversity

in the target distribution, i.e., all the modes of the reward function. This makes it natural for prac-

tical applications where the model should discover objects that are both interesting and diverse,

which is a focus of previous GFlowNet works (Bengio et al.,2021a,b,Jain et al.,2022b,Malkin et al.,

2022).

∗penny.ling.pan@gmail.com

arXiv:2210.03308v1 [cs.LG] 7 Oct 2022

Yet, GFlowNets only learn from the reward of the terminal state, and do not consider interme-

diate rewards, which can limit its applicability, especially in more general RL settings. Rewards

play a critical role in learning (Silver et al.,2021). The tremendous success of RL largely depends

on the reward signals that provide intermediate feedback. Even in environments with sparse re-

wards, RL agents can motivate themselves for efﬁcient exploration by intrinsic motivation, which

augments the sparse extrinsic learning signal with a dense intrinsic reward at each step. Our focus

in this paper is precisely on introducing such intermediate intrinsic rewards in GFlowNets, since

they can be applied even in settings where the extrinsic reward is sparse (say non-zero only on a

few terminal states).

Inspired by this missing element of GFlowNets, we propose a new GFlowNet learning frame-

work that takes intermediate feedback signals into account to provide an exploration incentive

during training. The notion of ﬂow in GFlowNets (Bengio et al.,2021a,b) refers to a marginalized

quantity that sums rewards over all downstream terminal states following a given state, while shar-

ing that reward with other states leading to the same terminal states. Apart from the existing ﬂows

in the network, we introduce augmented ﬂows as intermediate rewards. Our new framework is

well-suited for sparse reward tasks by considering intrinsic motivation as intermediate rewards,

where the training of GFlowNet can get trapped in a few modes, since it may be difﬁcult for it to

discover new modes based on those it visited (Bengio et al.,2021b).

We ﬁrst propose an edge-based augmented ﬂow, based on the incorporation of an intrinsic

reward at each transition. However, we ﬁnd that although it improves learning efﬁciency, it only

performs local exploration and still lacks sufﬁcient exploration ability to drive the agent to visit

solutions with zero rewards. On the other hand, we ﬁnd that incorporating intermediate rewards

in a state-based manner (Bengio et al.,2021b) can result in slower convergence and large bias em-

pirically, although it can explore more broadly. Therefore, we propose a joint way to take both

edge-based and state-based augmented ﬂows into account. Our method can improve the diversity

of solutions and learning efﬁciency by reaping the best from both worlds. Extensive experiments on

the GridWorld and molecule domains that are already used to benchmark GFlowNets corroborate

the effectiveness of our proposed framework.

The main contributions of this paper are summarized as follows:

• We propose a novel GFlowNet learning framework, dubbed Generative Augmented Flow

Networks (GAFlowNet), to incorporate intermediate rewards, which are represented by aug-

mented ﬂows in the ﬂow network.

• We specify intermediate rewards by intrinsic motivation to deal with the exploration of state

space for GFlowNets in sparse reward tasks. We theoretically prove that our augmented ob-

jective asymptotically yields an unbiased solution to the original formulation.

• We conduct extensive experiments on the GridWorld domain, demonstrating the effective-

ness of our method in terms of convergence, diversity, and performance. Our method is also

general, being applicable to different types of GFlowNets. We further extend our method to

the larger-scale and more challenging molecule generation task, where our method achieves

consistent and substantial improvements over strong baselines.

2 Background

Consider a directed acyclic graph (DAG) G= (S,A), where Sdenotes the state space, and A

represents the action space, which is a subset of S × S. We denote the vertex s0∈ S to be the initial

state with no incoming edges, while the vertex sfwithout outgoing edges is called the sink state,

and state-action pairs correspond to edges. The goal for GFlowNets is to learn a stochastic policy

πthat can construct discrete objects x∈ X with probability proportional to the reward function

R:X → R≥0,i.e.,π(x)∝R(x). GFlowNets construct objects sequentially, where each step adds

an element to the construction. We call the resulting sequence of state transitions from the initial

state to a terminal state τ= (s0→ · · · → sn)a trajectory, where τ∈ T with Tdenoting the set

of trajectories. Bengio et al. (2021a) deﬁne a trajectory ﬂow F:T → R≥0. Let F(s) = Pτ3sF(τ)

deﬁne a state ﬂow for any state s, and F(s→s0) = Pτ3s→s0F(τ)deﬁnes the edge ﬂow for any

edge s→s0. The trajectory ﬂow induces a probability measure P(τ) = F(τ)

Z, where Z=Pτ∈T F(τ)

denotes the total ﬂow. We then deﬁne the corresponding forward policy PF(s0|s) = F(s→s0)

F(s)and the

backward policy PB(s|s0) = F(s→s0)

F(s0). The ﬂows can be considered as the amount of water ﬂowing

through edges (like pipes) or states (like tees connecting pipes) (Malkin et al.,2022), with R(x)the

amount of water through terminal state x, and PF(s0|s)the relative amount of water ﬂowing in

edges outgoing from s.

2.1 GFlowNets training criterion

We call a ﬂow consistent if it satisﬁes the ﬂow matching constraint for all internal states s,i.e.,

Ps00 →sF(s00 →s) = F(s) = Ps→s0F(s→s0), which means that the incoming ﬂows equal the

outgoing ﬂows. Bengio et al. (2021a) prove that for a consistent ﬂow Fwhere the terminal ﬂow is

set to be the reward, the forward policy can sample objects xwith probability proportional to R(x).

Flow matching (FM). Bengio et al. (2021a) propose to approximate the edge ﬂow by a model

Fθ(s,s0)parameterized by θfollowing the FM objective, i.e.,LFM(s) = (log P(s00 →s)∈A Fθ(s00,s)−

log P(s→s0)∈A Fθ(s,s0))2for non-terminal states. At terminal states, a similar objective encourages

the incoming ﬂow to match the corresponding reward. The objective is optimized using trajectories

sampled from a training policy πwith full support such as a tempered version of PFθor a mixture

of PFθwith a uniform policy U,i.e.,πθ= (1 −)PFθ+·U, This is similar to -greedy and entropy-

regularized strategies in RL to improve exploration. Bengio et al. (2021a) prove that if we reach

a global minimum of the expected loss function and the training policy πθhas full support, then

GFlowNet samples from the target distribution.

Detailed balance (DB). Bengio et al. (2021b) propose the DB objective to avoid the computa-

tionally expensive summing operation over the parents or children of states. For learning based

on DB, we train a neural network with a state ﬂow model Fθ, a forward policy model PFθ(·|s), and

a backward policy model PBθ(·|s)parameterized by θ. The optimization objective is to minimize

LDB(s,s0) = (log(Fθ(s)PFθ(s0|s)) −log(Fθ(s0)PBθ(s|s0)))2. It also samples from the target distribu-

tion if a global minimum of the expected loss is reached and πθhas full support.

Trajectory balance (TB). Malkin et al. (2022) propose the TB objective for faster credit

assignment and learning over longer trajectories. The loss function for TB is LTB(τ) =

(log(ZθQn−1

t=0 PFθ(st+1|st)) −log(R(x)Qn−1

t=0 PB(st|st+1)))2, where Zθis a learnable parameter.

3 Related Work

GFlowNets. Since the proposal of GFlowNets (Bengio et al.,2021a), there has been an increasing

interest in improving (Bengio et al.,2021b,Malkin et al.,2022), understanding, and applying this

framework to practical scenarios. It is a general-purpose high-level probabilistic inference frame-

work, and induces fruitful applications (Deleu et al.,2022,Jain et al.,2022a,Zhang et al.,2022a,b).

However, previous works only consider learning based on the terminal reward, which can make

it difﬁcult to provide a good training signal for intermediate states, especially when the reward is

sparse (i.e., signiﬁcantly non-zero in only a tiny fraction of the terminal states).

Reinforcement learning (RL). Different from GFlowNets that aim to sample proportionally

to the reward function, RL learns a reward-maximization policy. Although introducing entropy

regularization to RL (Attias,2003,Haarnoja et al.,2017,2018,Ziebart,2010) can improve diversity,

this is limited to tree-structured DAGs. This is because it could only sample a terminal state xin

proportion to the sum of rewards over all trajectories leading to x. It can fail on general (non-tree)

DAGs (Bengio et al.,2021a) for which the same terminal state xcan be obtained with a potentially

large number of trajectories (and a very different number of trajectories for different x’s).

Intrinsic motivation. There has been a line of research to incorporate intrinsic motiva-

tion (Burda et al.,2018,Pathak et al.,2017,Zhang et al.,2021) for improving exploration in RL.

Yet, such ideas have not been explored with GFlowNets because the current mathematical frame-

work of GFlowNets only allows for terminal rewards, unlike the standard RL frameworks. This

deﬁciency as well as the potential of introducing intrinsic intermediate rewards motivates this pa-

per.

4 Generative Augmented Flow Networks

Figure 1: Comparison

of GFlowNet and our

augmented (GAFlowNet)

method in Gridworld

with sparse rewards.

The potential difﬁculty in learning only from the terminal reward is re-

lated to the challenge of sparse rewards in RL, where most states do

not provide an informative reward. We demonstrate the sparse reward

problem for GFlowNets and reveal interesting ﬁndings based on the

GridWorld task (as shown in Figure 4) with sparse rewards. Speciﬁ-

cally, the agent only receives a reward of +1 only when it reaches one

of the 3goals located around the corners of the world (except the start-

ing state corner) with size H×H(with H∈ {64,128}), and the reward

is 0otherwise. A more detailed description of the task can be found

in Section 5.1. We evaluate the number of modes discovered by the

GFlowNet trained with TB, following Bengio et al. (2021a). As sum-

marized in Figure 1, GFlowNet training can get trapped in a subset of

the modes. Therefore, it remains a critical challenge for GFlowNets to

efﬁciently learn when the reward signal is sparse and non-informative.

On the other hand, there has been recent progress with intrinsic

motivation methods (Burda et al.,2018,Pathak et al.,2017) to improve exploration of RL algorithms,

where the agent learns from both a sparse extrinsic reward and a dense intrinsic bonus at each

step. Building on this, we aim to address the exploration challenge of GFlowNets by enabling

intermediate rewards in GFlowNets and thus intrinsic rewards.

We now propose our learning framework, which is dubbed Generative Augmented Flow Net-

work (GAFlowNet), to take intermediate rewards into consideration.

4.1 Edge-based intermediate reward augmentation

We start our derivation from the ﬂow matching consistency constraint, to take advantage of the

insights brought by the water ﬂow metaphor as discussed in Section 2. By incorporating interme-

diate rewards r(st→st+1)for transitions from states stto st+1 into the ﬂow matching constraint,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GenerativeAugmentedFlowNetworksLingPan1DinghuaiZhang1AaronCourville1LongboHuang2YoshuaBengio11Mila,UniversitédeMontréal2TsinghuaUniversityAbstractTheGenerativeFlowNetwork(Bengioetal.,2021b,GFlowNet)isaprobabilisticframeworkwhereanagentlearnsastochasticpolicyforobjectgeneration,suchthattheprobabilit...

展开>> 收起<<

Generative Augmented Flow Networks Ling Pan1Dinghuai Zhang1Aaron Courville1Longbo Huang2Yoshua Bengio1 1Mila Université de Montréal2Tsinghua University.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generative Augmented Flow Networks Ling Pan1Dinghuai Zhang1Aaron Courville1Longbo Huang2Yoshua Bengio1 1Mila Université de Montréal2Tsinghua University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: