
Towards customizable reinforcement learning agents: Enabling preference specification through online vocabulary expansion
in symbolic terms, they could simply ask the system to
avoid any state where the fact in storage area might be true.
In this case, in storage area is a propositional fact that is
true in every state where the agent is inside the storage area.
To support such a symbolic specification, the AI system
should be able to correctly ground and thereby interpret
the potential concepts that the user wants to use. In this
case, PRESCA learns a classifier that can predict whether
in a given state the agent is in the storage area. Learning
this classifier requires the user to provide positive and
negative examples of the concept in storage area. In this
work, we make this data collection process more feedback
efficient. To do so, we leverage the precedence relation
in storage area has with other concepts that are already
known. Figure 2(b), illustrates the causal relationship
between various concepts in the domain. For e.g., the
concept in storage area precedes has broken ladder. This
causal link implies that the agent must be inside the storage
area to collect the broken ladder. Note that there can also be
concepts with multiple possible causes like the has ladder
concept. Now, if any of the descendents of the concept
in storage area is already known, then our proposed
method tries to use this causal knowledge to gather likely
positive examples of the concept in storage area.
3. Problem Setting and assumptions
We consider an RL problem in which an agent interacts
with an unknown environment ((Sutton & Barto,2018)).
The environment is modeled as a Markov Decision Pro-
cess (MDP). An MDP
M
can be formally defined as tuple
M=hS, A, T, R, γ, Soi
, where:
S
is the set of states in
the environment,
A
is the set of actions that the agent can
take,
T
is the transition function where
T(s, a, s0)
gives the
probability that agent will be in state
s0
after taking action
a
in state
s
,
R
is the reward function where
R(s, a, s0)
gives
the reward obtained for the transition
hs, a, s0i
,
γ
is the dis-
counting factor, and
So
is the set of all possible initial states.
A policy
π(a|s)
gives the probability that the agent will take
action
a∈A
while in the state
s∈S
. The value of a state
s
given a policy
π
,
Vπ(s)
is the expected cumulative dis-
counted future reward that the agent obtains when following
π
from the state
s
. For an MDP,
M
, the optimal policy
is the policy that maximizes the value for every state. Our
setting additionally considers the agent to be goal-directed.
This means that there is a set of goal states
G
and the agent
tries to reach one of them. We assume that the reward func-
tion
R
is set up in a way that any optimal policy must reach
one of the goal states with probability = 1.
In this work, we are interested in a human-AI interaction
setting, where the agent will interact with multiple users
over its lifetime. In each interaction, there will be a human-
in-the-loop, who wants the agent to achieve the goal subject
to their preferences. Thus, we develop the system, PRESCA,
that would allow any user to communicate their preference
to the agent. Now, the state representation used in the model
M
may be inscrutable to a user i.e. the user cannot directly
use it to specify their preference (e.g., an image based state
representation). Therefore, we consider the presence of a
symbolic interface which is a set
FS
of propositional state
variables or concepts that the any user would understand.
In any given state
s
, each concept
CZ∈FS
may be true
or false. In PRESCA, we currently support specifying pref-
erences that involve the agent avoiding states where some
specified concept is true.PRESCA can be easily extended
to support more complex preferences (section 4.5).
PRESCA starts with some partial vocabulary i.e. some con-
cepts are already present in
FS
and the accurate grounding
for these initial concept is available. This isn’t unlike other
neuro-symbolic approaches that assume known concepts
((Lyu et al.,2019), (Illanes et al.,2020), (Icarte et al.,2022)).
However, in a specific interaction with a user, if their pref-
erence cannot be specified in terms of any existing concept
in
FS
, then PRESCA supports learning the relevant novel
concept,
CT
. By learning a concept
CT
, we mean learn-
ing some grounding in the state representation used by the
agent. In our case, this grounding takes the form of a binary
classifier that takes as input a state image
s
and outputs true
if the concept
CT
is true in
s
and false otherwise. From
now on we will use the notation
CZ
to refer to both the
concept as well as its grounding. Now to learn the classifier
for the target concept
CT
, the system must collect positive
and negative examples of the concept from the user. For
this, our system presents the user with state queries where
they must choose whether the concept is present or absent
in the state. Once the classifier
CT
has been learned, it is
used to train an agent’s policy that aligns with the user’s
preference. Also, the classifier
CT
is added to the set
FS
to
support preference specifications by future users. We now
precisely define the objective of the PRESCA system for a
single interaction with a user when some new concept needs
to be learned. Given the environment
M
, a user, a symbolic
interface
FS
, and a target concept
CT
, the system must: (a)
minimize the number of queries made to the user to learn
CT
, and (b) use
CT
to train an agent’s policy that achieves
the goal while aligning with the user’s preference.
3.1. Causal model semantics
The PRESCA system uses the causal relationship between
the target concept
CT
and some already known concept
CK∈FS
to gather candidate states that most likely contain
CT
. The causal relationship is simply a partial causal model
of the domain. Intuitively, the causal model for a domain is
a directed graph with nodes representing concepts and edges
representing some causal association. Figure 2(b) shows
an example causal model for the Minecraft domain. In the