Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning Zih-Yun Chiu1Yi-Lin Tuan2William Yang Wang2Michael C. Yip1

2025-04-27 0 0 1.95MB 23 页 10玖币
侵权投诉
Flexible Attention-Based Multi-Policy Fusion for
Efficient Deep Reinforcement Learning
Zih-Yun Chiu1Yi-Lin Tuan2William Yang Wang2Michael C. Yip1
1University of California, San Diego 2University of California, Santa Barbara
Abstract
Reinforcement learning (RL) agents have long sought to approach the efficiency
of human learning. Humans are great observers who can learn by aggregating
external knowledge from various sources, including observations from others’
policies of attempting a task. Prior studies in RL have incorporated external
knowledge policies to help agents improve sample efficiency. However, it remains
non-trivial to perform arbitrary combinations and replacements of those policies,
an essential feature for generalization and transferability. In this work, we present
Knowledge-Grounded RL (KGRL), an RL paradigm fusing multiple knowledge
policies and aiming for human-like efficiency and flexibility. We propose a new
actor architecture for KGRL, Knowledge-Inclusive Attention Network (KIAN),
which allows free knowledge rearrangement due to embedding-based attentive
action prediction. KIAN also addresses entropy imbalance, a problem arising
in maximum entropy KGRL that hinders an agent from efficiently exploring the
environment, through a new design of policy distributions. The experimental
results demonstrate that KIAN outperforms alternative methods incorporating
external knowledge policies and achieves efficient and flexible learning. Our
implementation is available at https://github.com/Pascalson/KGRL.git.
1 Introduction
Reinforcement learning (RL) has been effectively used in a variety of fields, including physics [
7
,
35
]
and robotics [
15
,
30
]. This success can be attributed to RLs iterative process of interacting with the
environment and learning a policy to get positive feedback. Despite being influenced by the learning
process of infants [
32
], the RL process can require a large number of samples to solve a task [
1
],
indicating that the learning efficiency of RL agents is still far behind that of humans.
What learning capabilities do humans possess, yet RL agents still missing? Studies in social
learning [
4
] have demonstrated that humans often observe the behavior of others in diverse situations
and utilize those strategies as external knowledge to accelerate their own exploration of solution-
space. This type of learning is very flexible for humans since they can freely reuse and update the
knowledge they already possess. The followings are the five properties (the last four have been
mentioned in [
14
]) that summarize the efficiency and flexibility of human learning. [Knowledge-
Acquirable]: Humans can develop their strategies by observing others. [Sample-Efficient]: Humans
require fewer interactions with the environment to solve a task by learning from external knowledge.
[Generalizable]: Humans can apply previously observed strategies, whether developed internally or
provided externally, to unseen tasks. [Compositional]: Humans can combine strategies from multiple
sources to form their knowledge set. [Incremental]: Humans do not need to relearn how to navigate
the entire knowledge set from scratch when they remove outdated strategies or add new ones.
indicates equal contribution. The corresponding emails are zchiu@ucsd.edu and ytuan@cs.ucsb.edu
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.03729v2 [cs.LG] 9 Oct 2023
Policy
Observed
from
Solved
for
Amy Jack
Joy
Policy
Observed
from
Solved
for
Amy Jack
Joy
Myself
if ...
else ...
Figure 1: An illustration of knowledge-acquirable, compositional, and incremental properties in
KGRL. Joy first learns to ride a motorcycle by observing Amy skateboarding and Jack biking. Then
Joy learns to drive a car with the knowledge set expanded by Joy’s developed strategy of motorcycling.
Possessing all five learning properties remains challenging for RL agents. Previous work has endowed
an RL agent with the ability to learn from external knowledge (knowledge-acquirable) and mitigate
sample inefficiency [
21
,
25
,
27
,
36
], where the knowledge focused in this paper is state-action
mappings (full definition in Section 3), including pre-collected demonstrations or policies. Among
those methods, some have also allowed agents to combine policies in different forms to predict optimal
actions (compositional) [
25
,
27
]. However, these approaches may not be suitable for incremental
learning, in which an agent learns a sequence of tasks using one expandable knowledge set. In such
a case, whenever the knowledge set is updated by adding or replacing policies, prior methods, e.g.,
[
27
,
36
], require relearning the entire multi-policy fusion process, even if the current task is similar to
the previous one. This is because their designs of knowledge representations are intertwined with the
knowledge-fusing mechanism, which restricts changing the number of policies in the knowledge set.
To this end, our goal is to enhance RL grounded on external knowledge policies with more flexibility.
We first introduce Knowledge-Grounded Reinforcement Learning (KGRL), an RL paradigm that
seeks to find an optimal policy of a Markov Decision Process (MDP) given a set of external policies
as illustrated in Figure 1. We then formally define the knowledge-acquirable, sample-efficient,
generalizable, compositional, and incremental properties that a well-trained KGRL agent can possess.
We propose a simple yet effective actor model, Knowledge-Inclusive Attention Network (KIAN),
for KGRL. KIAN consists of three components: (1) an internal policy that learns a self-developed
strategy, (2) embeddings that represent each policy, and (3) a query that performs embedding-based
attentive action prediction to fuse the internal and external policies. The policy-embedding and query
design in KIAN is crucial, as it enables the model to be incremental by unifying policy representations
and separating them from the policy-fusing process. Consequently, updating or adding policies to
KIAN has minimal effect on its architecture and does not require retraining the entire network.
Additionally, KIAN addresses the problem of entropy imbalance in KGRL, where agents tend to
choose only a few sub-optimal policies from the knowledge set. We provide mathematical evidence
that entropy imbalance can prevent agents from exploring the environment with multiple policies.
Then we introduce a new approach for modeling external-policy distributions to mitigate this issue.
Through experiments on grid navigation [
5
] and robotic manipulation [
24
] tasks, KIAN outperforms
alternative methods incorporating external policies in terms of sample efficiency as well as the ability
to do compositional and incremental learning. Furthermore, our analyses suggest that KIAN has
better generalizability when applied to environments that are either simpler or more complex.
Our contributions are:
We introduce KGRL, an RL paradigm studying how agents learn with external policies while being
knowledge-acquirable, sample-efficient, generalizable, compositional, and incremental.
We propose KIAN, an actor model for KGRL that fuses multiple knowledge policies with better
flexibility and addresses entropy imbalance for more efficient exploration.
We demonstrate in experiments that KIAN outperforms other methods incorporating external
knowledge policies under different environmental setups.
2
2 Related Work
A popular line of research in RL is to improve sample efficiency with demonstrations (RL from
demonstrations; RLfD). Demonstrations are examples of completing a task and are represented
as state-action pairs. Previous work has leveraged demonstrations by introducing them into the
policy-update steps of RL [
8
,
11
,
21
,
23
,
28
,
34
]. For example, Nair et al.
[21]
adds a buffer of
demonstrations to the RL framework and uses the data sampled from it to calculate a behavior-
cloning loss. This loss is combined with the regular RL loss to make the policy simultaneously imitate
demonstrations and maximize the expected return. RLfD methods necessitate an adequate supply of
high-quality demonstrations to achieve sample-efficient learning, which can be time-consuming. In
addition, they are low-level representations of a policy. Consequently, if an agent fails to extract a
high-level strategy from these demonstrations, it will merely mimic the actions without acquiring a
generalizable policy. In contrast, our proposed KIAN enables an agent to learn with external policies
of arbitrary quality and fuse them by evaluating the importance of each policy to the task. Thus, the
agent must understand the high-level strategies of each policy rather than only imitating its actions.
Another research direction in RL focuses on utilizing sub-optimal external policies instead of
demonstrations to improve sample efficiency [
25
,
27
,
36
]. For instance, Zhang et al.
[36]
proposed
Knowledge-Guided Policy Network (KoGuN) that learns a neural network policy from fuzzy-rule
controllers. The neural network concatenates a state and all actions suggested by fuzzy-rule controllers
as an input and outputs a refined action. While effective, this method puts restrictions on the
representation of a policy to be a fuzzy logic network. On the other hand, Rajendran et al.
[27]
presented A2T (Attend, Adapt, and Transfer), an attentive deep architecture that fuses multiple
policies and does not restrict the form of a policy. These policies can be non-primitive, and a learnable
internal policy is included. In A2T, an attention network takes a state as an input and outputs the
weights of all policies. The agent then samples an action from the fused distribution based on these
weights. The methods KoGuN and A2T are most related to our work. Based on their success, KIAN
further relaxes their requirement of retraining for incremental learning since both of them depend
on the preset number of policies. Additionally, our approach mitigates the entropy imbalance issue,
which can lead to inefficient exploration and was not addressed by KoGuN and A2T.
There exist other RL frameworks, such as hierarchical RL (HRL), that tackle tasks involving multiple
policies. However, these frameworks are less closely related to our work compared to the previously
mentioned methods. HRL approaches aim to decompose a complex task into a hierarchy of sub-tasks
and learn a sub-policy for each sub-task [
2
,
6
,
13
,
16
18
,
20
,
25
,
31
,
33
]. On the other hand, KGRL
methods, including KoGuN, A2T, and KIAN, aim to address a task by observing a given set of
external policies. These policies may offer partial solutions, be overly intricate, or even have limited
relevance to the task at hand. Furthermore, HRL methods typically apply only one sub-policy to the
environment at each time step based on the high-level policy, which determines the sub-task the agent
is currently addressing. In contrast, KGRL seeks to simultaneously apply multiple policies within a
single time step by fusing them together.
3 Problem Formulation
Our goal is to investigate how RL can be grounded on any given set of external knowledge policies
to achieve knowledge-acquirable, sample-efficient, generalizable, compositional, and incremental
properties. We refer to this RL paradigm as Knowledge-Grounded Reinforcement Learning (KGRL).
A KGRL problem is a sequential decision-making problem that involves an environment, an agent,
and a set of external policies. It can be mathematically formulated as a Knowledge-Grounded Markov
Decision Process (KGMDP), which is defined by a tuple
Mk= (S,A,T, R, ρ, γ, G)
, where
S
is
the state space,
A
is the action space,
T:S × A × S R
is the transition probability distribution,
R
is the reward function,
ρ
is the initial state distribution,
γ
is the discount factor, and
G
is the
set of external knowledge policies. An external knowledge set
G
contains
n
knowledge policies,
G={πg1, . . . , πgn}
. Each knowledge policy is a function that maps from the state space to the action
space,
πgj(·|·) : S → A,j= 1, . . . , n
. A knowledge mapping is not necessarily designed for the
original Markov Decision Process (MDP), which is defined by the tuple
M= (S,A,T,R, ρ, γ)
.
Therefore, applying πgjto Mmay result in a poor expected return.
3
The goal of KGRL is to find an optimal policy
π(·|·;G) : S → A
that maximizes the expected
return:
Es0ρ,T[PT
t=0 γtRt]
. Note that
Mk
and
M
share the same optimal value function,
V(s) = max
πΠ
ET[P
k=0 γkRt+k+1|st=s], if they are provided with the same policy class Π.
A well-trained KGRL agent can possess the following properties: knowledge-acquirable, sample-
efficient, generalizable, compositional, and incremental. Here we formally define these properties.
Definition 3.1 (Knowledge-Acquirable).An agent can acquire knowledge internally instead of only
following
G
. We refer to this internal knowledge as an inner policy and denote it as
πin(·|·) : S → A
.
Definition 3.2 (Sample-Efficient).An agent requires fewer samples to solve for Mkthan for M.
Definition 3.3 (Generalizable).A learned policy π(·|·;G)can solve similar but different tasks.
Definition 3.4 (Compositional).Assume that other agents have solved for
m
KGMDPs,
M1
k,...,Mm
k
, with external knowledge sets,
G1,...,Gm
, and inner policies,
π1
in, . . . , πm
in
. An
agent is compositional if it can learn to solve a KGMDP
M
k
with the external knowledge set
GSm
i=1 Gi∪ {π1
in, . . . , πm
in}.
Definition 3.5 (Incremental).An agent is incremental if it has the following two abilities: (1) Given
a KGMDP
Mk
for the agent to solve within
T
timesteps. The agent can learn to solve
Mk
with
the external knowledge sets,
G1,...,GT
, where
Gt, t ∈ {1, . . . , T }
, is the knowledge set at time step
t
, and
Gt
can be different from one another. (2) Given a sequence of KGMDPs
M1
k,...,Mm
k
, the
agent can solve them with external knowledge sets,
G1,...,Gm
, where
Gi, i ∈ {1, . . . , m}
, is the
knowledge set for task i, and Gican be different from one another.
4 Knowledge-Inclusive Attention Network
KIAN
Inner Actor Knowledge
(g1)
kin
u
Knowledge
(g2)
Knowledge
(g3)
action (at) state (st)
Environment
Figure 2: The model architecture of KIAN.
We propose Knowledge-Inclusive Attention Net-
work (KIAN) as an actor for KGRL. KIAN
can be end-to-end trained with various RL algo-
rithms. Illustrated in Figure 2, KIAN comprises
three components: an inner actor, knowledge
keys, and a query. In this section, we first de-
scribe the architecture of KIAN and its action-
prediction operation. Then we introduce entropy
imbalance, a problem that emerges in maximum
entropy KGRL, and propose modified policy
distributions for KIAN to alleviate this issue.
4.1 Model Architecture
Inner Actor. The inner actor serves the same
purpose as an actor in regular RL, represent-
ing the inner knowledge learned by the agent
through interactions with the environment. In
KIAN, the inner actor, denoted as
πin(·|·;θ) : S → A
, is a learnable function approximator with
parameter
θ
. The presence of the inner actor in KIAN is crucial for the agent to be capable of
acquiring knowledge, as it allows the agent to develop its own strategies. Therefore, even if the
external knowledge policies in
G
are unable to solve a particular task, the agent can still discover an
optimal solution.
Knowledge Keys. In KIAN, we introduce a learnable embedding vector for each knowledge
policy, including
πin
and
πg1, . . . , πgn
, in order to create a unified representation space for all
knowledge policies. Specifically, for each knowledge mapping
πin
or
πgj∈ G
, we assign a learnable
dk
-dimensional vector as its key (embedding):
kin Rdk
or
kgjRdkj∈ {1, . . . , n}
. It is
important to note that these knowledge keys,
ke
, represents the entire knowledge mapping
πe,e
{in, g1, . . . , gn}
. Thus,
ke
is independent of specific states or actions. These knowledge keys and
the query will perform an attention operation to determine how an agent integrates all policies.
Our knowledge-key design is essential for an agent to be compositional and incremental. By unifying
the representation of policies through knowledge keys, we remove restrictions on the form of a
4
knowledge mapping. It can be any form, such as a lookup table of state-action pairs (demonstra-
tions) [
21
], if-else-based programs, fuzzy logics [
36
], or neural networks [
25
,
27
]. In addition, the
knowledge keys are not ordered, so
πg1, . . . , πgn
in
G
and their corresponding
kg1,...,kgn
can
be freely rearranged. Finally, since a knowledge policy is encoded as a key independent of other
knowledge keys in a joint embedding space, replacing a policy in
G
means replacing a knowledge
key in the embedding space. This replacement requires no changes in the other part of KIAN’s
architecture. Therefore, an agent can update
G
anytime without relearning a significant part of KIAN.
Query. The last component in KIAN, the query, is a function approximator that generates
dk
-
dimensional vectors for knowledge-policy fusion. The query is learnable with parameter
ϕ
and is
state-dependent, so we denote it as
Φ(·;ϕ) : S Rdk
. Given a state
st∈ S
, the query outputs a
dk
-dimensional vector
ut= Φ(st;ϕ)Rdk
, which will be used to perform an attention operation
with all knowledge keys. This operation determines the weights of policies when fusing them.
4.2 Embedding-Based Attentive Action Prediction
The way to predict an action with KIAN and a set of external knowledge policies,
G
, is by three steps:
(1) calculating a weight for each knowledge policy using an embedding-based attention operation, (2)
fusing knowledge policies with these weights, and (3) sampling an action from the fused policy.
Embedding-Based Attention Operation. Given a state st∈ S, KIAN predicts a weight for each
knowledge policy as how likely this policy will suggest a good action. These weights can be computed
by the dot product between the query and knowledge keys as:
wt,in = Φ(st;ϕ)·kin/ct,in R,
wt,gj= Φ(st;ϕ)·kgj/ct,gjR,j∈ {1, . . . , n}.(1)
[ ˆwt,in,ˆwt,g1,..., ˆwt,gn]=softmax([wt,in, wt,g1, . . . , wt,gn]).(2)
where
ct,in R
and
ct,gjR
are normalization factors, for example, if
ct,gj=Φ(st;ϕ)2kgj2
,
then
wt,gj
turns out to be the cosine similarity between
Φ(st;ϕ)
and
kgj
. We refer to this operation as
an embedding-based attention operation since the query evaluates each knowledge key (embedding)
by equation (1) to determine how much attention an agent should pay to the corresponding knowledge
policy. If
wt,in
is larger than
wt,gj
, the agent relies more on its self-learned knowledge policy
πin
;
otherwise, the agent depends more on the action suggested by the knowledge policy
πgj
. Note that
the computation of one weight is independent of other knowledge keys, so changing the number of
knowledge policies will not affect the relation among all remaining knowledge keys.
Action Prediction for A Discrete Action Space. An MDP (or KGMDP) with a discrete action
space usually involves choosing from
daN
different actions, so each knowledge policy maps from
a state to a
da
-dimensional probability simplex,
πin :S da, πgj:S daj= 1, . . . , n
.
When choosing an action given a state
st∈ S
, KIAN first predicts
π(·|st)daRda
with the
weights, ˆwin,ˆwg1,..., ˆwgn:
π(·|st) = ˆwinπin(·|st)+Σn
j=1 ˆwgjπgj(·|st),(3)
The final action is sampled as
atπ(·|st)
, where the
i
-th element of
π(·|st)
represents the probability
of sampling the i-th action.
Action Prediction for A Continuous Action Space. Each knowledge policy for a continuous
action space is a probability distribution that suggests a
da
-dimensional action for an agent to apply to
the task. As prior work [
25
], we model each knowledge policy as a multivariate normal distribution,
πin(·|st) = N(µt,in,σ2
t,in), πgj(·|st) = N(µt,gj,σ2
t,gj)j∈ {1, . . . , n}
, where
µt,in Rda
and
µt,gjRda
are the means, and
σ2
t,in Rda
0
and
σ2
t,gjRda
0
are the diagonals of the covariance
matrices. Note that we assume each random variable in an action is independent of one another.
A continuous policy fused as equation (3) becomes a mixture of normal distributions. To
sample an action from this mixture of distributions without losing the important informa-
tion provided by each distribution, we choose only one knowledge policy according to
the weights and sample an action from it. We first sample an element from the set
5
摘要:

FlexibleAttention-BasedMulti-PolicyFusionforEfficientDeepReinforcementLearningZih-YunChiu1∗Yi-LinTuan2∗WilliamYangWang2MichaelC.Yip11UniversityofCalifornia,SanDiego2UniversityofCalifornia,SantaBarbaraAbstractReinforcementlearning(RL)agentshavelongsoughttoapproachtheefficiencyofhumanlearning.Humansar...

展开>> 收起<<
Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning Zih-Yun Chiu1Yi-Lin Tuan2William Yang Wang2Michael C. Yip1.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:1.95MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注