Towards customizable reinforcement learning agents Enabling preference specification through online vocabulary expansion

2025-05-06 0 0 704.38KB 12 页 10玖币
侵权投诉
Towards customizable reinforcement learning agents: Enabling preference
specification through online vocabulary expansion
Utkarsh Soni 1Nupur Thakur 1Sarath Sreedharan 2Lin Guan 1Mudit Verma 1Matthew Marquez 1
Subbarao Kambhampati 1
Abstract
There is a growing interest in developing auto-
mated agents that can work alongside humans. In
addition to completing the assigned task, such an
agent will undoubtedly be expected to behave in
a manner that is preferred by the human. This
requires the human to communicate their prefer-
ences to the agent. To achieve this, the current
approaches either require the users to specify the
reward function or the preference is interactively
learned from queries that ask the user to compare
behavior. The former approach can be challeng-
ing if the internal representation used by the agent
is inscrutable to the human while the latter is un-
necessarily cumbersome for the user if their pref-
erence can be specified more easily in symbolic
terms. In this work, we propose PRESCA (PREf-
erence Specification through Concept Acquisi-
tion), a system that allows users to specify their
preferences in terms of concepts that they under-
stand. PRESCA maintains a set of such concepts
in a shared vocabulary. If the relevant concept
is not in the shared vocabulary, then it is learned.
To make learning a new concept more feedback
efficient, PRESCA leverages causal associations
between the target concept and concepts that are
already known. In addition, we use a novel data
augmentation approach to further reduce required
feedback. We evaluate PRESCA by using it on a
Minecraft environment and show that it can effec-
tively align the agent with the user’s preference.
1. Introduction
With recent successes in AI, there is a great interest in de-
ploying autonomous agents into our day to day lives. In
order to cohabit successfully with humans, it is highly im-
portant that the AI agent behaves in a way that is aligned
1
Arizona State University
2
Colorado State University. Corre-
spondence to: Utkarsh Soni <usoni1@asu.edu>.
with the human preferences. Ideally, we want a system that
will enable every day users to specify their preferences over
AI system behavior. In the reinforcement learning (RL) lit-
erature, the current go to approach for specifying behavioral
preferences is through preference-based reinforcement learn-
ing techniques ((Christiano et al.,2017); (Lee et al.,2021))
that try to learn the human’s preference interactively through
trajectory comparisons. These techniques are useful for tacit
knowledge tasks. However, it would be highly inefficient
to use these techniques in scenarios where the preference
can simply be specified in symbolic terms. Another way
for specifying behavioral preferences is through modifying
rewards; but it can be fairly non-intuitive for a lay user to
come up with a reward structure that leads to the preferred
behavior (Hadfield-Menell et al.,2017). In addition, speci-
fying rewards becomes more challenging when the system
is operating over an inscrutable high-dimensional state rep-
resentation (like images). An alternate is to allow humans
to specify preferences in symbolic terms. These symbolic
concepts can be propositional state variables that the user
understands. Thus, a more suitable framework that lets user
specify their preferences would consist of a symbolic in-
terface made of such concepts that enables communication
with the user while the agent uses some inscrutable internal
representation for the task.
There already exists a line of works ((Lyu et al.,2019);
(Illanes et al.,2020); (Icarte et al.,2022)) that let the user
specify task-related information to an RL agent in symbolic
terms. Unfortunately, these works assume that all concepts
relevant to the task information are already known, i.e., the
grounding of each of these concepts is available. However,
since each user’s preference can be unique, their preference
could involve concepts that are not present in the agent’s
vocabulary. In this work, we propose an AI system named
PRESCA (PREference Specification through Concept Acqui-
sition) that maintains a symbolic interface made of concepts
that the user can use to specify their preferences to the agent.
If the concept that is relevant to the user’s preference is miss-
ing from the interface, then PRESCA will try to learn this
concept online. Subsequently, the concept is also added to
the interface to support preference specifications by future
users. Thus, the cost of learning the concept gets amor-
arXiv:2210.15096v2 [cs.AI] 31 Jan 2023
Towards customizable reinforcement learning agents: Enabling preference specification through online vocabulary expansion
Figure 1.
Overview of PRESCA. (1) The user specifies their pref-
erence in terms of some symbolic concept. If the concept is not
present in the symbolic interface, then the user provides its causal
relationship to some known concept. (2) PRESCA then generates
likely positive examples and negative examples of the concept and
queries their label to the user. (3) After getting the labels, and data
augmentation PRESCA learns a classifier for the target concept and
(4) incorporates user’s preference in agent’s training. (5) Finally,
the concept is added to the interface.
tized when future users make use of the concept. Once the
preference has been specified, PRESCA uses it to train the
agent to align with the user’s preferences. The focus of this
work, will be to propose a method that allows us to learn
human concepts effectively. A simple way for the system to
learn a concept is to learn a grounding from the concept to
system states. One could learn these groundings from a set
of positive and negative examples of the concept. Obtaining
these examples is a challenging problem as there is no clear
way for the user to generate these examples. One way to
obtain these examples is for the system to present the user
with states and ask them whether the concept is present in
each state. A naive way to generate these queries could be to
randomly sample states from the environment. However, if
the positive examples are sparse in the state space, then this
strategy would lead to a possibly large number of queries.
Our objective is to make the data collection process more
feedback efficient by automatically gathering likely positive
and negative examples of the concept and then querying the
user for their labels. To this end, we leverage the causal asso-
ciation the target concept has with the concepts that already
exist in the symbolic interface. This causal knowledge can
be automatically obtained if a symbolic PDDL-like model
is available ((Geffner & Bonet,2013); (Helmert,2004)).
If such a domain model is not present, then we expect the
user of the system to have some understanding about the
task dynamics and provide the causal relationship between
the target concept and some known concept. In addition to
leveraging the causal relationship to reduce human labeling
effort, PRESCA also uses a novel data augmentation strategy
to reduce the amount of labels needed from the user. In fig-
ure 1, we provides the overall flow of the user’s interaction
with the PRESCA system. In the following section, we first
introduce the planning domain that we use to illustrate and
evaluate our approach. This is followed by a formal descrip-
tion of the environment model and the symbolic interface
used by our AI system. We then provide a methodology that
efficiently learns a new concept and then uses the learned
concept to guide the agent’s training. In the evaluation sec-
tion (section 5), we show the performance of our approach
on a Minecraft domain given increasingly complex causal
relations. We follow that up with a discussion that compares
our approach to existing AI approaches that can also be
potentially used to incorporate user’s preference. Finally,
we discuss the possible improvements in section 7.
2. Illustrative example
We will use a version of the 2-D Minecraft environment (An-
dreas et al.,2017) (figure 2(a)) to illustrate the ideas of the
paper and evaluate our proposed technique. In this domain,
the goal is to drop a ladder at the target. The set of actions
the agent can take include, turning left or right, moving
forward, picking an object, a crafting action, a no-op action,
and an action that drops the ladder when the agent is at the
target. To accomplish the goal, the agent needs to first obtain
a ladder. There are two ways to achieve this. One way is for
the agent to pick up a plank; and then use the crafting station
to craft a ladder. Another way is to first move into the stor-
age area (green region in figure 2(a)), then pick up a broken
ladder and then use the crafting station to repair the ladder.
Once the agent has the ladder, it can move to the target
and drop it. We show the two possible plans in figure 2(a).
Figure 2.
(a) Instance of the Minecraft
environment with two possible plans
marked with arrows. The user prefers
that the agent avoid going into the storage
area (indicated by the red arrows) (b) the
causal model of the Minecraft environ-
ment.
There is also a
human observer
who wants the
agent to solve
the task in the
way they prefer.
The observer
wants the agent
to avoid going
inside the storage
area. Since one
way of solving
the task involves
moving into the
storage area,
the human must
communicate this preference to the agent. In this scenario,
the agent is operating on some state encoding that the
human cannot interpret. Thus, they cannot specify their
preference directly in terms of state features. However, if
the human is capable of communicating with the AI system
Towards customizable reinforcement learning agents: Enabling preference specification through online vocabulary expansion
in symbolic terms, they could simply ask the system to
avoid any state where the fact in storage area might be true.
In this case, in storage area is a propositional fact that is
true in every state where the agent is inside the storage area.
To support such a symbolic specification, the AI system
should be able to correctly ground and thereby interpret
the potential concepts that the user wants to use. In this
case, PRESCA learns a classifier that can predict whether
in a given state the agent is in the storage area. Learning
this classifier requires the user to provide positive and
negative examples of the concept in storage area. In this
work, we make this data collection process more feedback
efficient. To do so, we leverage the precedence relation
in storage area has with other concepts that are already
known. Figure 2(b), illustrates the causal relationship
between various concepts in the domain. For e.g., the
concept in storage area precedes has broken ladder. This
causal link implies that the agent must be inside the storage
area to collect the broken ladder. Note that there can also be
concepts with multiple possible causes like the has ladder
concept. Now, if any of the descendents of the concept
in storage area is already known, then our proposed
method tries to use this causal knowledge to gather likely
positive examples of the concept in storage area.
3. Problem Setting and assumptions
We consider an RL problem in which an agent interacts
with an unknown environment ((Sutton & Barto,2018)).
The environment is modeled as a Markov Decision Pro-
cess (MDP). An MDP
M
can be formally defined as tuple
M=hS, A, T, R, γ, Soi
, where:
S
is the set of states in
the environment,
A
is the set of actions that the agent can
take,
T
is the transition function where
T(s, a, s0)
gives the
probability that agent will be in state
s0
after taking action
a
in state
s
,
R
is the reward function where
R(s, a, s0)
gives
the reward obtained for the transition
hs, a, s0i
,
γ
is the dis-
counting factor, and
So
is the set of all possible initial states.
A policy
π(a|s)
gives the probability that the agent will take
action
aA
while in the state
sS
. The value of a state
s
given a policy
π
,
Vπ(s)
is the expected cumulative dis-
counted future reward that the agent obtains when following
π
from the state
s
. For an MDP,
M
, the optimal policy
is the policy that maximizes the value for every state. Our
setting additionally considers the agent to be goal-directed.
This means that there is a set of goal states
G
and the agent
tries to reach one of them. We assume that the reward func-
tion
R
is set up in a way that any optimal policy must reach
one of the goal states with probability = 1.
In this work, we are interested in a human-AI interaction
setting, where the agent will interact with multiple users
over its lifetime. In each interaction, there will be a human-
in-the-loop, who wants the agent to achieve the goal subject
to their preferences. Thus, we develop the system, PRESCA,
that would allow any user to communicate their preference
to the agent. Now, the state representation used in the model
M
may be inscrutable to a user i.e. the user cannot directly
use it to specify their preference (e.g., an image based state
representation). Therefore, we consider the presence of a
symbolic interface which is a set
FS
of propositional state
variables or concepts that the any user would understand.
In any given state
s
, each concept
CZFS
may be true
or false. In PRESCA, we currently support specifying pref-
erences that involve the agent avoiding states where some
specified concept is true.PRESCA can be easily extended
to support more complex preferences (section 4.5).
PRESCA starts with some partial vocabulary i.e. some con-
cepts are already present in
FS
and the accurate grounding
for these initial concept is available. This isn’t unlike other
neuro-symbolic approaches that assume known concepts
((Lyu et al.,2019), (Illanes et al.,2020), (Icarte et al.,2022)).
However, in a specific interaction with a user, if their pref-
erence cannot be specified in terms of any existing concept
in
FS
, then PRESCA supports learning the relevant novel
concept,
CT
. By learning a concept
CT
, we mean learn-
ing some grounding in the state representation used by the
agent. In our case, this grounding takes the form of a binary
classifier that takes as input a state image
s
and outputs true
if the concept
CT
is true in
s
and false otherwise. From
now on we will use the notation
CZ
to refer to both the
concept as well as its grounding. Now to learn the classifier
for the target concept
CT
, the system must collect positive
and negative examples of the concept from the user. For
this, our system presents the user with state queries where
they must choose whether the concept is present or absent
in the state. Once the classifier
CT
has been learned, it is
used to train an agent’s policy that aligns with the user’s
preference. Also, the classifier
CT
is added to the set
FS
to
support preference specifications by future users. We now
precisely define the objective of the PRESCA system for a
single interaction with a user when some new concept needs
to be learned. Given the environment
M
, a user, a symbolic
interface
FS
, and a target concept
CT
, the system must: (a)
minimize the number of queries made to the user to learn
CT
, and (b) use
CT
to train an agent’s policy that achieves
the goal while aligning with the user’s preference.
3.1. Causal model semantics
The PRESCA system uses the causal relationship between
the target concept
CT
and some already known concept
CKFS
to gather candidate states that most likely contain
CT
. The causal relationship is simply a partial causal model
of the domain. Intuitively, the causal model for a domain is
a directed graph with nodes representing concepts and edges
representing some causal association. Figure 2(b) shows
an example causal model for the Minecraft domain. In the
摘要:

Towardscustomizablereinforcementlearningagents:EnablingpreferencespecicationthroughonlinevocabularyexpansionUtkarshSoni1NupurThakur1SarathSreedharan2LinGuan1MuditVerma1MatthewMarquez1SubbaraoKambhampati1AbstractThereisagrowinginterestindevelopingauto-matedagentsthatcanworkalongsidehumans.Inaddition...

展开>> 收起<<
Towards customizable reinforcement learning agents Enabling preference specification through online vocabulary expansion.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:704.38KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注