Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning Zih-Yun Chiu1Yi-Lin Tuan2William Yang Wang2Michael C. Yip1

2025-04-27 0 0 1.95MB 23 页 10玖币

侵权投诉

Flexible Attention-Based Multi-Policy Fusion for

Efﬁcient Deep Reinforcement Learning

Zih-Yun Chiu1∗Yi-Lin Tuan2∗William Yang Wang2Michael C. Yip1

1University of California, San Diego 2University of California, Santa Barbara

Abstract

Reinforcement learning (RL) agents have long sought to approach the efﬁciency

of human learning. Humans are great observers who can learn by aggregating

external knowledge from various sources, including observations from others’

policies of attempting a task. Prior studies in RL have incorporated external

knowledge policies to help agents improve sample efﬁciency. However, it remains

non-trivial to perform arbitrary combinations and replacements of those policies,

an essential feature for generalization and transferability. In this work, we present

Knowledge-Grounded RL (KGRL), an RL paradigm fusing multiple knowledge

policies and aiming for human-like efﬁciency and ﬂexibility. We propose a new

actor architecture for KGRL, Knowledge-Inclusive Attention Network (KIAN),

which allows free knowledge rearrangement due to embedding-based attentive

action prediction. KIAN also addresses entropy imbalance, a problem arising

in maximum entropy KGRL that hinders an agent from efﬁciently exploring the

environment, through a new design of policy distributions. The experimental

results demonstrate that KIAN outperforms alternative methods incorporating

external knowledge policies and achieves efﬁcient and ﬂexible learning. Our

implementation is available at https://github.com/Pascalson/KGRL.git.

1 Introduction

Reinforcement learning (RL) has been effectively used in a variety of ﬁelds, including physics [

]

and robotics [

]. This success can be attributed to RL’s iterative process of interacting with the

environment and learning a policy to get positive feedback. Despite being inﬂuenced by the learning

process of infants [

], the RL process can require a large number of samples to solve a task [

indicating that the learning efﬁciency of RL agents is still far behind that of humans.

What learning capabilities do humans possess, yet RL agents still missing? Studies in social

learning [

] have demonstrated that humans often observe the behavior of others in diverse situations

and utilize those strategies as external knowledge to accelerate their own exploration of solution-

space. This type of learning is very ﬂexible for humans since they can freely reuse and update the

knowledge they already possess. The followings are the ﬁve properties (the last four have been

mentioned in [

]) that summarize the efﬁciency and ﬂexibility of human learning. [Knowledge-

Acquirable]: Humans can develop their strategies by observing others. [Sample-Efﬁcient]: Humans

require fewer interactions with the environment to solve a task by learning from external knowledge.

[Generalizable]: Humans can apply previously observed strategies, whether developed internally or

provided externally, to unseen tasks. [Compositional]: Humans can combine strategies from multiple

sources to form their knowledge set. [Incremental]: Humans do not need to relearn how to navigate

the entire knowledge set from scratch when they remove outdated strategies or add new ones.

∗indicates equal contribution. The corresponding emails are zchiu@ucsd.edu and ytuan@cs.ucsb.edu

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.03729v2 [cs.LG] 9 Oct 2023

Policy

Observed

from

Solved

for

Amy Jack

Joy

Policy

Observed

from

Solved

for

Amy Jack

Joy

Myself

if ...

else ...

Figure 1: An illustration of knowledge-acquirable, compositional, and incremental properties in

KGRL. Joy ﬁrst learns to ride a motorcycle by observing Amy skateboarding and Jack biking. Then

Joy learns to drive a car with the knowledge set expanded by Joy’s developed strategy of motorcycling.

Possessing all ﬁve learning properties remains challenging for RL agents. Previous work has endowed

an RL agent with the ability to learn from external knowledge (knowledge-acquirable) and mitigate

sample inefﬁciency [

], where the knowledge focused in this paper is state-action

mappings (full deﬁnition in Section 3), including pre-collected demonstrations or policies. Among

those methods, some have also allowed agents to combine policies in different forms to predict optimal

actions (compositional) [

]. However, these approaches may not be suitable for incremental

learning, in which an agent learns a sequence of tasks using one expandable knowledge set. In such

a case, whenever the knowledge set is updated by adding or replacing policies, prior methods, e.g.,

[

], require relearning the entire multi-policy fusion process, even if the current task is similar to

the previous one. This is because their designs of knowledge representations are intertwined with the

knowledge-fusing mechanism, which restricts changing the number of policies in the knowledge set.

To this end, our goal is to enhance RL grounded on external knowledge policies with more ﬂexibility.

We ﬁrst introduce Knowledge-Grounded Reinforcement Learning (KGRL), an RL paradigm that

seeks to ﬁnd an optimal policy of a Markov Decision Process (MDP) given a set of external policies

as illustrated in Figure 1. We then formally deﬁne the knowledge-acquirable, sample-efﬁcient,

generalizable, compositional, and incremental properties that a well-trained KGRL agent can possess.

We propose a simple yet effective actor model, Knowledge-Inclusive Attention Network (KIAN),

for KGRL. KIAN consists of three components: (1) an internal policy that learns a self-developed

strategy, (2) embeddings that represent each policy, and (3) a query that performs embedding-based

attentive action prediction to fuse the internal and external policies. The policy-embedding and query

design in KIAN is crucial, as it enables the model to be incremental by unifying policy representations

and separating them from the policy-fusing process. Consequently, updating or adding policies to

KIAN has minimal effect on its architecture and does not require retraining the entire network.

Additionally, KIAN addresses the problem of entropy imbalance in KGRL, where agents tend to

choose only a few sub-optimal policies from the knowledge set. We provide mathematical evidence

that entropy imbalance can prevent agents from exploring the environment with multiple policies.

Then we introduce a new approach for modeling external-policy distributions to mitigate this issue.

Through experiments on grid navigation [

] and robotic manipulation [

] tasks, KIAN outperforms

alternative methods incorporating external policies in terms of sample efﬁciency as well as the ability

to do compositional and incremental learning. Furthermore, our analyses suggest that KIAN has

better generalizability when applied to environments that are either simpler or more complex.

Our contributions are:

•

We introduce KGRL, an RL paradigm studying how agents learn with external policies while being

knowledge-acquirable, sample-efﬁcient, generalizable, compositional, and incremental.

•

We propose KIAN, an actor model for KGRL that fuses multiple knowledge policies with better

ﬂexibility and addresses entropy imbalance for more efﬁcient exploration.

•

We demonstrate in experiments that KIAN outperforms other methods incorporating external

knowledge policies under different environmental setups.

2 Related Work

A popular line of research in RL is to improve sample efﬁciency with demonstrations (RL from

demonstrations; RLfD). Demonstrations are examples of completing a task and are represented

as state-action pairs. Previous work has leveraged demonstrations by introducing them into the

policy-update steps of RL [

]. For example, Nair et al.

[21]

adds a buffer of

demonstrations to the RL framework and uses the data sampled from it to calculate a behavior-

cloning loss. This loss is combined with the regular RL loss to make the policy simultaneously imitate

demonstrations and maximize the expected return. RLfD methods necessitate an adequate supply of

high-quality demonstrations to achieve sample-efﬁcient learning, which can be time-consuming. In

addition, they are low-level representations of a policy. Consequently, if an agent fails to extract a

high-level strategy from these demonstrations, it will merely mimic the actions without acquiring a

generalizable policy. In contrast, our proposed KIAN enables an agent to learn with external policies

of arbitrary quality and fuse them by evaluating the importance of each policy to the task. Thus, the

agent must understand the high-level strategies of each policy rather than only imitating its actions.

Another research direction in RL focuses on utilizing sub-optimal external policies instead of

demonstrations to improve sample efﬁciency [

]. For instance, Zhang et al.

[36]

proposed

Knowledge-Guided Policy Network (KoGuN) that learns a neural network policy from fuzzy-rule

controllers. The neural network concatenates a state and all actions suggested by fuzzy-rule controllers

as an input and outputs a reﬁned action. While effective, this method puts restrictions on the

representation of a policy to be a fuzzy logic network. On the other hand, Rajendran et al.

[27]

presented A2T (Attend, Adapt, and Transfer), an attentive deep architecture that fuses multiple

policies and does not restrict the form of a policy. These policies can be non-primitive, and a learnable

internal policy is included. In A2T, an attention network takes a state as an input and outputs the

weights of all policies. The agent then samples an action from the fused distribution based on these

weights. The methods KoGuN and A2T are most related to our work. Based on their success, KIAN

further relaxes their requirement of retraining for incremental learning since both of them depend

on the preset number of policies. Additionally, our approach mitigates the entropy imbalance issue,

which can lead to inefﬁcient exploration and was not addressed by KoGuN and A2T.

There exist other RL frameworks, such as hierarchical RL (HRL), that tackle tasks involving multiple

policies. However, these frameworks are less closely related to our work compared to the previously

mentioned methods. HRL approaches aim to decompose a complex task into a hierarchy of sub-tasks

and learn a sub-policy for each sub-task [

–

]. On the other hand, KGRL

methods, including KoGuN, A2T, and KIAN, aim to address a task by observing a given set of

external policies. These policies may offer partial solutions, be overly intricate, or even have limited

relevance to the task at hand. Furthermore, HRL methods typically apply only one sub-policy to the

environment at each time step based on the high-level policy, which determines the sub-task the agent

is currently addressing. In contrast, KGRL seeks to simultaneously apply multiple policies within a

single time step by fusing them together.

3 Problem Formulation

Our goal is to investigate how RL can be grounded on any given set of external knowledge policies

to achieve knowledge-acquirable, sample-efﬁcient, generalizable, compositional, and incremental

properties. We refer to this RL paradigm as Knowledge-Grounded Reinforcement Learning (KGRL).

A KGRL problem is a sequential decision-making problem that involves an environment, an agent,

and a set of external policies. It can be mathematically formulated as a Knowledge-Grounded Markov

Decision Process (KGMDP), which is deﬁned by a tuple

Mk= (S,A,T, R, ρ, γ, G)

, where

the state space,

is the action space,

T:S × A × S → R

is the transition probability distribution,

is the reward function,

is the initial state distribution,

is the discount factor, and

is the

set of external knowledge policies. An external knowledge set

contains

knowledge policies,

G={πg1, . . . , πgn}

. Each knowledge policy is a function that maps from the state space to the action

space,

πgj(·|·) : S → A,∀j= 1, . . . , n

. A knowledge mapping is not necessarily designed for the

original Markov Decision Process (MDP), which is deﬁned by the tuple

M= (S,A,T,R, ρ, γ)

Therefore, applying πgjto Mmay result in a poor expected return.

The goal of KGRL is to ﬁnd an optimal policy

π∗(·|·;G) : S → A

that maximizes the expected

return:

Es0∼ρ,T,π∗[PT

t=0 γtRt]

. Note that

and

share the same optimal value function,

V∗(s) = max

π∈Π

ET,π[P∞

k=0 γkRt+k+1|st=s], if they are provided with the same policy class Π.

A well-trained KGRL agent can possess the following properties: knowledge-acquirable, sample-

efﬁcient, generalizable, compositional, and incremental. Here we formally deﬁne these properties.

Deﬁnition 3.1 (Knowledge-Acquirable).An agent can acquire knowledge internally instead of only

following

. We refer to this internal knowledge as an inner policy and denote it as

πin(·|·) : S → A

Deﬁnition 3.2 (Sample-Efﬁcient).An agent requires fewer samples to solve for Mkthan for M.

Deﬁnition 3.3 (Generalizable).A learned policy π(·|·;G)can solve similar but different tasks.

Deﬁnition 3.4 (Compositional).Assume that other agents have solved for

KGMDPs,

k,...,Mm

, with external knowledge sets,

G1,...,Gm

, and inner policies,

π1

in, . . . , πm

. An

agent is compositional if it can learn to solve a KGMDP

M∗

with the external knowledge set

G∗⊆Sm

i=1 Gi∪ {π1

in, . . . , πm

in}.

Deﬁnition 3.5 (Incremental).An agent is incremental if it has the following two abilities: (1) Given

a KGMDP

for the agent to solve within

timesteps. The agent can learn to solve

with

the external knowledge sets,

G1,...,GT

, where

Gt, t ∈ {1, . . . , T }

, is the knowledge set at time step

, and

can be different from one another. (2) Given a sequence of KGMDPs

k,...,Mm

, the

agent can solve them with external knowledge sets,

G1,...,Gm

, where

Gi, i ∈ {1, . . . , m}

, is the

knowledge set for task i, and Gican be different from one another.

4 Knowledge-Inclusive Attention Network

KIAN

Inner Actor Knowledge

(g1)

kin

Knowledge

(g2)

Knowledge

(g3)

action (at) state (st)

Environment

Figure 2: The model architecture of KIAN.

We propose Knowledge-Inclusive Attention Net-

work (KIAN) as an actor for KGRL. KIAN

can be end-to-end trained with various RL algo-

rithms. Illustrated in Figure 2, KIAN comprises

three components: an inner actor, knowledge

keys, and a query. In this section, we ﬁrst de-

scribe the architecture of KIAN and its action-

prediction operation. Then we introduce entropy

imbalance, a problem that emerges in maximum

entropy KGRL, and propose modiﬁed policy

distributions for KIAN to alleviate this issue.

4.1 Model Architecture

Inner Actor. The inner actor serves the same

purpose as an actor in regular RL, represent-

ing the inner knowledge learned by the agent

through interactions with the environment. In

KIAN, the inner actor, denoted as

πin(·|·;θ) : S → A

, is a learnable function approximator with

parameter

. The presence of the inner actor in KIAN is crucial for the agent to be capable of

acquiring knowledge, as it allows the agent to develop its own strategies. Therefore, even if the

external knowledge policies in

are unable to solve a particular task, the agent can still discover an

optimal solution.

Knowledge Keys. In KIAN, we introduce a learnable embedding vector for each knowledge

policy, including

πin

and

πg1, . . . , πgn

, in order to create a uniﬁed representation space for all

knowledge policies. Speciﬁcally, for each knowledge mapping

πin

πgj∈ G

, we assign a learnable

-dimensional vector as its key (embedding):

kin ∈Rdk

kgj∈Rdk∀j∈ {1, . . . , n}

. It is

important to note that these knowledge keys,

, represents the entire knowledge mapping

πe,∀e∈

{in, g1, . . . , gn}

. Thus,

is independent of speciﬁc states or actions. These knowledge keys and

the query will perform an attention operation to determine how an agent integrates all policies.

Our knowledge-key design is essential for an agent to be compositional and incremental. By unifying

the representation of policies through knowledge keys, we remove restrictions on the form of a

knowledge mapping. It can be any form, such as a lookup table of state-action pairs (demonstra-

tions) [

], if-else-based programs, fuzzy logics [

], or neural networks [

]. In addition, the

knowledge keys are not ordered, so

πg1, . . . , πgn

and their corresponding

kg1,...,kgn

can

be freely rearranged. Finally, since a knowledge policy is encoded as a key independent of other

knowledge keys in a joint embedding space, replacing a policy in

means replacing a knowledge

key in the embedding space. This replacement requires no changes in the other part of KIAN’s

architecture. Therefore, an agent can update

anytime without relearning a signiﬁcant part of KIAN.

Query. The last component in KIAN, the query, is a function approximator that generates

dimensional vectors for knowledge-policy fusion. The query is learnable with parameter

and is

state-dependent, so we denote it as

Φ(·;ϕ) : S → Rdk

. Given a state

st∈ S

, the query outputs a

-dimensional vector

ut= Φ(st;ϕ)∈Rdk

, which will be used to perform an attention operation

with all knowledge keys. This operation determines the weights of policies when fusing them.

4.2 Embedding-Based Attentive Action Prediction

The way to predict an action with KIAN and a set of external knowledge policies,

, is by three steps:

(1) calculating a weight for each knowledge policy using an embedding-based attention operation, (2)

fusing knowledge policies with these weights, and (3) sampling an action from the fused policy.

Embedding-Based Attention Operation. Given a state st∈ S, KIAN predicts a weight for each

knowledge policy as how likely this policy will suggest a good action. These weights can be computed

by the dot product between the query and knowledge keys as:

wt,in = Φ(st;ϕ)·kin/ct,in ∈R,

wt,gj= Φ(st;ϕ)·kgj/ct,gj∈R,∀j∈ {1, . . . , n}.(1)

[ ˆwt,in,ˆwt,g1,..., ˆwt,gn]⊤=softmax([wt,in, wt,g1, . . . , wt,gn]⊤).(2)

where

ct,in ∈R

and

ct,gj∈R

are normalization factors, for example, if

ct,gj=∥Φ(st;ϕ)∥2∥kgj∥2

then

wt,gj

turns out to be the cosine similarity between

Φ(st;ϕ)

and

kgj

. We refer to this operation as

an embedding-based attention operation since the query evaluates each knowledge key (embedding)

by equation (1) to determine how much attention an agent should pay to the corresponding knowledge

policy. If

wt,in

is larger than

wt,gj

, the agent relies more on its self-learned knowledge policy

πin

;

otherwise, the agent depends more on the action suggested by the knowledge policy

πgj

. Note that

the computation of one weight is independent of other knowledge keys, so changing the number of

knowledge policies will not affect the relation among all remaining knowledge keys.

Action Prediction for A Discrete Action Space. An MDP (or KGMDP) with a discrete action

space usually involves choosing from

da∈N

different actions, so each knowledge policy maps from

a state to a

-dimensional probability simplex,

πin :S → ∆da, πgj:S → ∆da∀j= 1, . . . , n

When choosing an action given a state

st∈ S

, KIAN ﬁrst predicts

π(·|st)∈∆da⊆Rda

with the

weights, ˆwin,ˆwg1,..., ˆwgn:

π(·|st) = ˆwinπin(·|st)+Σn

j=1 ˆwgjπgj(·|st),(3)

The ﬁnal action is sampled as

at∼π(·|st)

, where the

-th element of

π(·|st)

represents the probability

of sampling the i-th action.

Action Prediction for A Continuous Action Space. Each knowledge policy for a continuous

action space is a probability distribution that suggests a

-dimensional action for an agent to apply to

the task. As prior work [

], we model each knowledge policy as a multivariate normal distribution,

πin(·|st) = N(µt,in,σ2

t,in), πgj(·|st) = N(µt,gj,σ2

t,gj)∀j∈ {1, . . . , n}

, where

µt,in ∈Rda

and

µt,gj∈Rda

are the means, and

σ2

t,in ∈Rda

≥0

and

σ2

t,gj∈Rda

≥0

are the diagonals of the covariance

matrices. Note that we assume each random variable in an action is independent of one another.

A continuous policy fused as equation (3) becomes a mixture of normal distributions. To

sample an action from this mixture of distributions without losing the important informa-

tion provided by each distribution, we choose only one knowledge policy according to

the weights and sample an action from it. We ﬁrst sample an element from the set

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FlexibleAttention-BasedMulti-PolicyFusionforEfficientDeepReinforcementLearningZih-YunChiu1∗Yi-LinTuan2∗WilliamYangWang2MichaelC.Yip11UniversityofCalifornia,SanDiego2UniversityofCalifornia,SantaBarbaraAbstractReinforcementlearning(RL)agentshavelongsoughttoapproachtheefficiencyofhumanlearning.Humansar...

展开>> 收起<<

Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning Zih-Yun Chiu1Yi-Lin Tuan2William Yang Wang2Michael C. Yip1.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning Zih-Yun Chiu1Yi-Lin Tuan2William Yang Wang2Michael C. Yip1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: