knowledge mapping. It can be any form, such as a lookup table of state-action pairs (demonstra-
tions) [
21
], if-else-based programs, fuzzy logics [
36
], or neural networks [
25
,
27
]. In addition, the
knowledge keys are not ordered, so
πg1, . . . , πgn
in
G
and their corresponding
kg1,...,kgn
can
be freely rearranged. Finally, since a knowledge policy is encoded as a key independent of other
knowledge keys in a joint embedding space, replacing a policy in
G
means replacing a knowledge
key in the embedding space. This replacement requires no changes in the other part of KIAN’s
architecture. Therefore, an agent can update
G
anytime without relearning a significant part of KIAN.
Query. The last component in KIAN, the query, is a function approximator that generates
dk
-
dimensional vectors for knowledge-policy fusion. The query is learnable with parameter
ϕ
and is
state-dependent, so we denote it as
Φ(·;ϕ) : S → Rdk
. Given a state
st∈ S
, the query outputs a
dk
-dimensional vector
ut= Φ(st;ϕ)∈Rdk
, which will be used to perform an attention operation
with all knowledge keys. This operation determines the weights of policies when fusing them.
4.2 Embedding-Based Attentive Action Prediction
The way to predict an action with KIAN and a set of external knowledge policies,
G
, is by three steps:
(1) calculating a weight for each knowledge policy using an embedding-based attention operation, (2)
fusing knowledge policies with these weights, and (3) sampling an action from the fused policy.
Embedding-Based Attention Operation. Given a state st∈ S, KIAN predicts a weight for each
knowledge policy as how likely this policy will suggest a good action. These weights can be computed
by the dot product between the query and knowledge keys as:
wt,in = Φ(st;ϕ)·kin/ct,in ∈R,
wt,gj= Φ(st;ϕ)·kgj/ct,gj∈R,∀j∈ {1, . . . , n}.(1)
[ ˆwt,in,ˆwt,g1,..., ˆwt,gn]⊤=softmax([wt,in, wt,g1, . . . , wt,gn]⊤).(2)
where
ct,in ∈R
and
ct,gj∈R
are normalization factors, for example, if
ct,gj=∥Φ(st;ϕ)∥2∥kgj∥2
,
then
wt,gj
turns out to be the cosine similarity between
Φ(st;ϕ)
and
kgj
. We refer to this operation as
an embedding-based attention operation since the query evaluates each knowledge key (embedding)
by equation (1) to determine how much attention an agent should pay to the corresponding knowledge
policy. If
wt,in
is larger than
wt,gj
, the agent relies more on its self-learned knowledge policy
πin
;
otherwise, the agent depends more on the action suggested by the knowledge policy
πgj
. Note that
the computation of one weight is independent of other knowledge keys, so changing the number of
knowledge policies will not affect the relation among all remaining knowledge keys.
Action Prediction for A Discrete Action Space. An MDP (or KGMDP) with a discrete action
space usually involves choosing from
da∈N
different actions, so each knowledge policy maps from
a state to a
da
-dimensional probability simplex,
πin :S → ∆da, πgj:S → ∆da∀j= 1, . . . , n
.
When choosing an action given a state
st∈ S
, KIAN first predicts
π(·|st)∈∆da⊆Rda
with the
weights, ˆwin,ˆwg1,..., ˆwgn:
π(·|st) = ˆwinπin(·|st)+Σn
j=1 ˆwgjπgj(·|st),(3)
The final action is sampled as
at∼π(·|st)
, where the
i
-th element of
π(·|st)
represents the probability
of sampling the i-th action.
Action Prediction for A Continuous Action Space. Each knowledge policy for a continuous
action space is a probability distribution that suggests a
da
-dimensional action for an agent to apply to
the task. As prior work [
25
], we model each knowledge policy as a multivariate normal distribution,
πin(·|st) = N(µt,in,σ2
t,in), πgj(·|st) = N(µt,gj,σ2
t,gj)∀j∈ {1, . . . , n}
, where
µt,in ∈Rda
and
µt,gj∈Rda
are the means, and
σ2
t,in ∈Rda
≥0
and
σ2
t,gj∈Rda
≥0
are the diagonals of the covariance
matrices. Note that we assume each random variable in an action is independent of one another.
A continuous policy fused as equation (3) becomes a mixture of normal distributions. To
sample an action from this mixture of distributions without losing the important informa-
tion provided by each distribution, we choose only one knowledge policy according to
the weights and sample an action from it. We first sample an element from the set
5