Advice Conformance Verification by Reinforcement Learning agents for Human-in-the-Loop Mudit Verma1 Ayush Kharkwal1 and Subbarao Kambhampati1

2025-04-27 0 0 2.19MB 5 页 10玖币
侵权投诉
Advice Conformance Verification by Reinforcement Learning agents for
Human-in-the-Loop
Mudit Verma1, Ayush Kharkwal1, and Subbarao Kambhampati1
Abstract Human-in-the-loop (HiL) reinforcement learning
is gaining traction in domains with large action and state
spaces, and sparse rewards by allowing the agent to take advice
from HiL. Beyond advice accommodation, a sequential decision-
making agent must be able to express the extent to which it was
able to utilize the human advice. Subsequently, the agent should
provide a means for the HiL to inspect parts of advice that it
had to reject in favor of the overall environment objective.
We introduce the problem of Advice-Conformance Verification
which requires reinforcement learning (RL) agents to provide
assurances to the human in the loop regarding how much of
their advice is being conformed to. We then propose a Tree-
based lingua-franca to support this communication, called a
Preference Tree. We study two cases of good and bad advice
scenarios in MuJoCo’s Humanoid environment. Through our
experiments, we show that our method can provide an inter-
pretable means of solving the Advice-Conformance Verification
problem by conveying whether or not the agent is using the
human’s advice. Finally, we present a human-user study with
20 participants that validates our method.
I. INTRODUCTION
Deep Reinforcement Learning has struggled with sparse
reward environments resulting in several frameworks utiliz-
ing Human-in-the-Loop (HiL) that have shown promising
success. These works [1], [3]–[5], [7], [8], [13] utilize advice
or preferences from humans as a form of guidance, however,
a missing aspect of these works is the inability of the RL
agent to provide assurances to the human user regarding to
what extent their advice was accommodated by the agent. We
term this problem as the Advice-Conformance Verification
problem which requires an RL agent to provide assurances
or explanations that conveys whether the agent conforms to
the human advice and how much of it the agent let go for
the larger interest of completing the task.
It is well known in the field of Human-aware-AI that
humans can form expectations of the agents they are inter-
acting with via several means [14], [15], for example when
they observe the agent’s behavior. Similarly, we posit that
when an agent requests human advice to achieve the task as
determined by environment rewards, the human in the loop
may establish the belief that the agent’s success on the task
is the consequence of following their advice. We leverage
[5] for observing that so long the agent attempts to optimize
for the underlying environment reward, even in the presence
of bad advice the agent is still able to obtain a good policy
(as shown in Fig. 3). However, such a belief may be ill-
placed (in the event of either a poor advice or misspecified
environment rewards) and in many situations for example,
1SCAI, Arizona State University, AZ , 85281
where the safety of the human in the loop is of concern,
such beliefs should be corrected. The Advice-Conformance
Verification captures this issue by requiring agents to allow
the HiL to inspect whether the advice was utilized in the
intended manner and if possible what parts of the given
advice were rejected by the agent.
II. BACKGROUND
In the Human in the Loop Reinforcement learning works
like [1], [4], [5], [13], the learning paradigm involves the
agent acting in an environment Eby sensing an observation
ot∈ O at time t. As in traditional reinforcement learning
these methods model the environment as an MDP tuple
(O,T,A,R)where O,Aare the agent’s observation and
action spaces, Tis the transition function governed by the
environment dynamics and Rare the environment rewards.
Additionally, several works take into account human advice
in different ways for example action advice [6], policy
advice [11], or reward advice [5], [13]. We are interested
in leveraging works that perform reward shaping [9] as a
means to accommodate human advice.
The agent’s aim in this class of problems is to come
up with a policy πθsuch that it achieves the maximum
possible returns computed over rewards R. Note that the
agent typically would at least have the environment reward
Rand a shaped reward ˆ
Rthat it computes using the human
advice (which itself could be represented in the form of a
reward function, say F). The human advice, therefore, is
meant to aid the agent in achieving the task specified by
rewards R. In this work, we will take [5] as the backbone
HiL RL algorithm and propose a solution to the advice
conformance verification problem in this setup.
III. METHOD
Our solution to the Advice-Conformance Verification
problem is to establish the lingua franca between the human
user and the RL agent in the form of a Preference Tree
which is a directed acyclic graph computed from the given
preferences. The Preference Tree computed using the Human
in the Loop is termed the Human-Preference Tree. We
present a method to extract a Preference Tree from the
RL agent at any time during the agent’s training regime,
referred to as the Agent’s Preference Tree. We use the pair
of preference trees, the one extracted by the agent from the
human and the other generated by the agent, as a means
to convey how the advice has been utilized by the agent
in the learning process. We expect a significant deviation
arXiv:2210.03455v1 [cs.AI] 7 Oct 2022
摘要:

AdviceConformanceVericationbyReinforcementLearningagentsforHuman-in-the-LoopMuditVerma1,AyushKharkwal1,andSubbaraoKambhampati1Abstract—Human-in-the-loop(HiL)reinforcementlearningisgainingtractionindomainswithlargeactionandstatespaces,andsparserewardsbyallowingtheagenttotakeadvicefromHiL.Beyondadvic...

展开>> 收起<<
Advice Conformance Verification by Reinforcement Learning agents for Human-in-the-Loop Mudit Verma1 Ayush Kharkwal1 and Subbarao Kambhampati1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:2.19MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注