Advice Conformance Veriﬁcation by Reinforcement Learning agents for Human-in-the-Loop Mudit Verma1 Ayush Kharkwal1 and Subbarao Kambhampati1

2025-04-27 0 0 2.19MB 5 页 10玖币

侵权投诉

Advice Conformance Veriﬁcation by Reinforcement Learning agents for

Human-in-the-Loop

Mudit Verma1, Ayush Kharkwal1, and Subbarao Kambhampati1

Abstract— Human-in-the-loop (HiL) reinforcement learning

is gaining traction in domains with large action and state

spaces, and sparse rewards by allowing the agent to take advice

from HiL. Beyond advice accommodation, a sequential decision-

making agent must be able to express the extent to which it was

able to utilize the human advice. Subsequently, the agent should

provide a means for the HiL to inspect parts of advice that it

had to reject in favor of the overall environment objective.

We introduce the problem of Advice-Conformance Veriﬁcation

which requires reinforcement learning (RL) agents to provide

assurances to the human in the loop regarding how much of

their advice is being conformed to. We then propose a Tree-

based lingua-franca to support this communication, called a

Preference Tree. We study two cases of good and bad advice

scenarios in MuJoCo’s Humanoid environment. Through our

experiments, we show that our method can provide an inter-

pretable means of solving the Advice-Conformance Veriﬁcation

problem by conveying whether or not the agent is using the

human’s advice. Finally, we present a human-user study with

20 participants that validates our method.

I. INTRODUCTION

Deep Reinforcement Learning has struggled with sparse

reward environments resulting in several frameworks utiliz-

ing Human-in-the-Loop (HiL) that have shown promising

success. These works [1], [3]–[5], [7], [8], [13] utilize advice

or preferences from humans as a form of guidance, however,

a missing aspect of these works is the inability of the RL

agent to provide assurances to the human user regarding to

what extent their advice was accommodated by the agent. We

term this problem as the Advice-Conformance Veriﬁcation

problem which requires an RL agent to provide assurances

or explanations that conveys whether the agent conforms to

the human advice and how much of it the agent let go for

the larger interest of completing the task.

It is well known in the ﬁeld of Human-aware-AI that

humans can form expectations of the agents they are inter-

acting with via several means [14], [15], for example when

they observe the agent’s behavior. Similarly, we posit that

when an agent requests human advice to achieve the task as

determined by environment rewards, the human in the loop

may establish the belief that the agent’s success on the task

is the consequence of following their advice. We leverage

[5] for observing that so long the agent attempts to optimize

for the underlying environment reward, even in the presence

of bad advice the agent is still able to obtain a good policy

(as shown in Fig. 3). However, such a belief may be ill-

placed (in the event of either a poor advice or misspeciﬁed

environment rewards) and in many situations for example,

1SCAI, Arizona State University, AZ , 85281

where the safety of the human in the loop is of concern,

such beliefs should be corrected. The Advice-Conformance

Veriﬁcation captures this issue by requiring agents to allow

the HiL to inspect whether the advice was utilized in the

intended manner and if possible what parts of the given

advice were rejected by the agent.

II. BACKGROUND

In the Human in the Loop Reinforcement learning works

like [1], [4], [5], [13], the learning paradigm involves the

agent acting in an environment Eby sensing an observation

ot∈ O at time t. As in traditional reinforcement learning

these methods model the environment as an MDP tuple

(O,T,A,R)where O,Aare the agent’s observation and

action spaces, Tis the transition function governed by the

environment dynamics and Rare the environment rewards.

Additionally, several works take into account human advice

in different ways for example action advice [6], policy

advice [11], or reward advice [5], [13]. We are interested

in leveraging works that perform reward shaping [9] as a

means to accommodate human advice.

The agent’s aim in this class of problems is to come

up with a policy πθsuch that it achieves the maximum

possible returns computed over rewards R. Note that the

agent typically would at least have the environment reward

Rand a shaped reward ˆ

Rthat it computes using the human

advice (which itself could be represented in the form of a

reward function, say F). The human advice, therefore, is

meant to aid the agent in achieving the task speciﬁed by

rewards R. In this work, we will take [5] as the backbone

HiL RL algorithm and propose a solution to the advice

conformance veriﬁcation problem in this setup.

III. METHOD

Our solution to the Advice-Conformance Veriﬁcation

problem is to establish the lingua franca between the human

user and the RL agent in the form of a Preference Tree

which is a directed acyclic graph computed from the given

preferences. The Preference Tree computed using the Human

in the Loop is termed the Human-Preference Tree. We

present a method to extract a Preference Tree from the

RL agent at any time during the agent’s training regime,

referred to as the Agent’s Preference Tree. We use the pair

of preference trees, the one extracted by the agent from the

human and the other generated by the agent, as a means

to convey how the advice has been utilized by the agent

in the learning process. We expect a signiﬁcant deviation

arXiv:2210.03455v1 [cs.AI] 7 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdviceConformanceVericationbyReinforcementLearningagentsforHuman-in-the-LoopMuditVerma1,AyushKharkwal1,andSubbaraoKambhampati1AbstractHuman-in-the-loop(HiL)reinforcementlearningisgainingtractionindomainswithlargeactionandstatespaces,andsparserewardsbyallowingtheagenttotakeadvicefromHiL.Beyondadvic...

展开>> 收起<<

Advice Conformance Veriﬁcation by Reinforcement Learning agents for Human-in-the-Loop Mudit Verma1 Ayush Kharkwal1 and Subbarao Kambhampati1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Advice Conformance Veriﬁcation by Reinforcement Learning agents for Human-in-the-Loop Mudit Verma1 Ayush Kharkwal1 and Subbarao Kambhampati1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: