Towards customizable reinforcement learning agents Enabling preference specification through online vocabulary expansion

2025-05-06 0 0 704.38KB 12 页 10玖币

侵权投诉

Towards customizable reinforcement learning agents: Enabling preference

speciﬁcation through online vocabulary expansion

Utkarsh Soni 1Nupur Thakur 1Sarath Sreedharan 2Lin Guan 1Mudit Verma 1Matthew Marquez 1

Subbarao Kambhampati 1

Abstract

There is a growing interest in developing auto-

mated agents that can work alongside humans. In

addition to completing the assigned task, such an

agent will undoubtedly be expected to behave in

a manner that is preferred by the human. This

requires the human to communicate their prefer-

ences to the agent. To achieve this, the current

approaches either require the users to specify the

reward function or the preference is interactively

learned from queries that ask the user to compare

behavior. The former approach can be challeng-

ing if the internal representation used by the agent

is inscrutable to the human while the latter is un-

necessarily cumbersome for the user if their pref-

erence can be speciﬁed more easily in symbolic

terms. In this work, we propose PRESCA (PREf-

erence Speciﬁcation through Concept Acquisi-

tion), a system that allows users to specify their

preferences in terms of concepts that they under-

stand. PRESCA maintains a set of such concepts

in a shared vocabulary. If the relevant concept

is not in the shared vocabulary, then it is learned.

To make learning a new concept more feedback

efﬁcient, PRESCA leverages causal associations

between the target concept and concepts that are

already known. In addition, we use a novel data

augmentation approach to further reduce required

feedback. We evaluate PRESCA by using it on a

Minecraft environment and show that it can effec-

tively align the agent with the user’s preference.

1. Introduction

With recent successes in AI, there is a great interest in de-

ploying autonomous agents into our day to day lives. In

order to cohabit successfully with humans, it is highly im-

portant that the AI agent behaves in a way that is aligned

Arizona State University

Colorado State University. Corre-

spondence to: Utkarsh Soni <usoni1@asu.edu>.

with the human preferences. Ideally, we want a system that

will enable every day users to specify their preferences over

AI system behavior. In the reinforcement learning (RL) lit-

erature, the current go to approach for specifying behavioral

preferences is through preference-based reinforcement learn-

ing techniques ((Christiano et al.,2017); (Lee et al.,2021))

that try to learn the human’s preference interactively through

trajectory comparisons. These techniques are useful for tacit

knowledge tasks. However, it would be highly inefﬁcient

to use these techniques in scenarios where the preference

can simply be speciﬁed in symbolic terms. Another way

for specifying behavioral preferences is through modifying

rewards; but it can be fairly non-intuitive for a lay user to

come up with a reward structure that leads to the preferred

behavior (Hadﬁeld-Menell et al.,2017). In addition, speci-

fying rewards becomes more challenging when the system

is operating over an inscrutable high-dimensional state rep-

resentation (like images). An alternate is to allow humans

to specify preferences in symbolic terms. These symbolic

concepts can be propositional state variables that the user

understands. Thus, a more suitable framework that lets user

specify their preferences would consist of a symbolic in-

terface made of such concepts that enables communication

with the user while the agent uses some inscrutable internal

representation for the task.

There already exists a line of works ((Lyu et al.,2019);

(Illanes et al.,2020); (Icarte et al.,2022)) that let the user

specify task-related information to an RL agent in symbolic

terms. Unfortunately, these works assume that all concepts

relevant to the task information are already known, i.e., the

grounding of each of these concepts is available. However,

since each user’s preference can be unique, their preference

could involve concepts that are not present in the agent’s

vocabulary. In this work, we propose an AI system named

PRESCA (PREference Speciﬁcation through Concept Acqui-

sition) that maintains a symbolic interface made of concepts

that the user can use to specify their preferences to the agent.

If the concept that is relevant to the user’s preference is miss-

ing from the interface, then PRESCA will try to learn this

concept online. Subsequently, the concept is also added to

the interface to support preference speciﬁcations by future

users. Thus, the cost of learning the concept gets amor-

arXiv:2210.15096v2 [cs.AI] 31 Jan 2023

Towards customizable reinforcement learning agents: Enabling preference speciﬁcation through online vocabulary expansion

Figure 1.

Overview of PRESCA. (1) The user speciﬁes their pref-

erence in terms of some symbolic concept. If the concept is not

present in the symbolic interface, then the user provides its causal

relationship to some known concept. (2) PRESCA then generates

likely positive examples and negative examples of the concept and

queries their label to the user. (3) After getting the labels, and data

augmentation PRESCA learns a classiﬁer for the target concept and

(4) incorporates user’s preference in agent’s training. (5) Finally,

the concept is added to the interface.

tized when future users make use of the concept. Once the

preference has been speciﬁed, PRESCA uses it to train the

agent to align with the user’s preferences. The focus of this

work, will be to propose a method that allows us to learn

human concepts effectively. A simple way for the system to

learn a concept is to learn a grounding from the concept to

system states. One could learn these groundings from a set

of positive and negative examples of the concept. Obtaining

these examples is a challenging problem as there is no clear

way for the user to generate these examples. One way to

obtain these examples is for the system to present the user

with states and ask them whether the concept is present in

each state. A naive way to generate these queries could be to

randomly sample states from the environment. However, if

the positive examples are sparse in the state space, then this

strategy would lead to a possibly large number of queries.

Our objective is to make the data collection process more

feedback efﬁcient by automatically gathering likely positive

and negative examples of the concept and then querying the

user for their labels. To this end, we leverage the causal asso-

ciation the target concept has with the concepts that already

exist in the symbolic interface. This causal knowledge can

be automatically obtained if a symbolic PDDL-like model

is available ((Geffner & Bonet,2013); (Helmert,2004)).

If such a domain model is not present, then we expect the

user of the system to have some understanding about the

task dynamics and provide the causal relationship between

the target concept and some known concept. In addition to

leveraging the causal relationship to reduce human labeling

effort, PRESCA also uses a novel data augmentation strategy

to reduce the amount of labels needed from the user. In ﬁg-

ure 1, we provides the overall ﬂow of the user’s interaction

with the PRESCA system. In the following section, we ﬁrst

introduce the planning domain that we use to illustrate and

evaluate our approach. This is followed by a formal descrip-

tion of the environment model and the symbolic interface

used by our AI system. We then provide a methodology that

efﬁciently learns a new concept and then uses the learned

concept to guide the agent’s training. In the evaluation sec-

tion (section 5), we show the performance of our approach

on a Minecraft domain given increasingly complex causal

relations. We follow that up with a discussion that compares

our approach to existing AI approaches that can also be

potentially used to incorporate user’s preference. Finally,

we discuss the possible improvements in section 7.

2. Illustrative example

We will use a version of the 2-D Minecraft environment (An-

dreas et al.,2017) (ﬁgure 2(a)) to illustrate the ideas of the

paper and evaluate our proposed technique. In this domain,

the goal is to drop a ladder at the target. The set of actions

the agent can take include, turning left or right, moving

forward, picking an object, a crafting action, a no-op action,

and an action that drops the ladder when the agent is at the

target. To accomplish the goal, the agent needs to ﬁrst obtain

a ladder. There are two ways to achieve this. One way is for

the agent to pick up a plank; and then use the crafting station

to craft a ladder. Another way is to ﬁrst move into the stor-

age area (green region in ﬁgure 2(a)), then pick up a broken

ladder and then use the crafting station to repair the ladder.

Once the agent has the ladder, it can move to the target

and drop it. We show the two possible plans in ﬁgure 2(a).

Figure 2.

(a) Instance of the Minecraft

environment with two possible plans

marked with arrows. The user prefers

that the agent avoid going into the storage

area (indicated by the red arrows) (b) the

causal model of the Minecraft environ-

ment.

There is also a

human observer

who wants the

agent to solve

the task in the

way they prefer.

The observer

wants the agent

to avoid going

inside the storage

area. Since one

way of solving

the task involves

moving into the

storage area,

the human must

communicate this preference to the agent. In this scenario,

the agent is operating on some state encoding that the

human cannot interpret. Thus, they cannot specify their

preference directly in terms of state features. However, if

the human is capable of communicating with the AI system

Towards customizable reinforcement learning agents: Enabling preference speciﬁcation through online vocabulary expansion

in symbolic terms, they could simply ask the system to

avoid any state where the fact in storage area might be true.

In this case, in storage area is a propositional fact that is

true in every state where the agent is inside the storage area.

To support such a symbolic speciﬁcation, the AI system

should be able to correctly ground and thereby interpret

the potential concepts that the user wants to use. In this

case, PRESCA learns a classiﬁer that can predict whether

in a given state the agent is in the storage area. Learning

this classiﬁer requires the user to provide positive and

negative examples of the concept in storage area. In this

work, we make this data collection process more feedback

efﬁcient. To do so, we leverage the precedence relation

in storage area has with other concepts that are already

known. Figure 2(b), illustrates the causal relationship

between various concepts in the domain. For e.g., the

concept in storage area precedes has broken ladder. This

causal link implies that the agent must be inside the storage

area to collect the broken ladder. Note that there can also be

concepts with multiple possible causes like the has ladder

concept. Now, if any of the descendents of the concept

in storage area is already known, then our proposed

method tries to use this causal knowledge to gather likely

positive examples of the concept in storage area.

3. Problem Setting and assumptions

We consider an RL problem in which an agent interacts

with an unknown environment ((Sutton & Barto,2018)).

The environment is modeled as a Markov Decision Pro-

cess (MDP). An MDP

can be formally deﬁned as tuple

M=hS, A, T, R, γ, Soi

, where:

is the set of states in

the environment,

is the set of actions that the agent can

take,

is the transition function where

T(s, a, s0)

gives the

probability that agent will be in state

after taking action

in state

is the reward function where

R(s, a, s0)

gives

the reward obtained for the transition

hs, a, s0i

is the dis-

counting factor, and

is the set of all possible initial states.

A policy

π(a|s)

gives the probability that the agent will take

action

a∈A

while in the state

s∈S

. The value of a state

given a policy

Vπ(s)

is the expected cumulative dis-

counted future reward that the agent obtains when following

from the state

. For an MDP,

, the optimal policy

is the policy that maximizes the value for every state. Our

setting additionally considers the agent to be goal-directed.

This means that there is a set of goal states

and the agent

tries to reach one of them. We assume that the reward func-

tion

is set up in a way that any optimal policy must reach

one of the goal states with probability = 1.

In this work, we are interested in a human-AI interaction

setting, where the agent will interact with multiple users

over its lifetime. In each interaction, there will be a human-

in-the-loop, who wants the agent to achieve the goal subject

to their preferences. Thus, we develop the system, PRESCA,

that would allow any user to communicate their preference

to the agent. Now, the state representation used in the model

may be inscrutable to a user i.e. the user cannot directly

use it to specify their preference (e.g., an image based state

representation). Therefore, we consider the presence of a

symbolic interface which is a set

of propositional state

variables or concepts that the any user would understand.

In any given state

, each concept

CZ∈FS

may be true

or false. In PRESCA, we currently support specifying pref-

erences that involve the agent avoiding states where some

speciﬁed concept is true.PRESCA can be easily extended

to support more complex preferences (section 4.5).

PRESCA starts with some partial vocabulary i.e. some con-

cepts are already present in

and the accurate grounding

for these initial concept is available. This isn’t unlike other

neuro-symbolic approaches that assume known concepts

((Lyu et al.,2019), (Illanes et al.,2020), (Icarte et al.,2022)).

However, in a speciﬁc interaction with a user, if their pref-

erence cannot be speciﬁed in terms of any existing concept

, then PRESCA supports learning the relevant novel

concept,

. By learning a concept

, we mean learn-

ing some grounding in the state representation used by the

agent. In our case, this grounding takes the form of a binary

classiﬁer that takes as input a state image

and outputs true

if the concept

is true in

and false otherwise. From

now on we will use the notation

to refer to both the

concept as well as its grounding. Now to learn the classiﬁer

for the target concept

, the system must collect positive

and negative examples of the concept from the user. For

this, our system presents the user with state queries where

they must choose whether the concept is present or absent

in the state. Once the classiﬁer

has been learned, it is

used to train an agent’s policy that aligns with the user’s

preference. Also, the classiﬁer

is added to the set

support preference speciﬁcations by future users. We now

precisely deﬁne the objective of the PRESCA system for a

single interaction with a user when some new concept needs

to be learned. Given the environment

, a user, a symbolic

interface

, and a target concept

, the system must: (a)

minimize the number of queries made to the user to learn

, and (b) use

to train an agent’s policy that achieves

the goal while aligning with the user’s preference.

3.1. Causal model semantics

The PRESCA system uses the causal relationship between

the target concept

and some already known concept

CK∈FS

to gather candidate states that most likely contain

. The causal relationship is simply a partial causal model

of the domain. Intuitively, the causal model for a domain is

a directed graph with nodes representing concepts and edges

representing some causal association. Figure 2(b) shows

an example causal model for the Minecraft domain. In the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Towardscustomizablereinforcementlearningagents:EnablingpreferencespecicationthroughonlinevocabularyexpansionUtkarshSoni1NupurThakur1SarathSreedharan2LinGuan1MuditVerma1MatthewMarquez1SubbaraoKambhampati1AbstractThereisagrowinginterestindevelopingauto-matedagentsthatcanworkalongsidehumans.Inaddition...

展开>> 收起<<

Towards customizable reinforcement learning agents Enabling preference specification through online vocabulary expansion.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards customizable reinforcement learning agents Enabling preference specification through online vocabulary expansion

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: