
II. RELATED WORK
1) Ad-Hoc Teaming: Ad-hoc teaming in Human Robot
Interaction (HRI) requires the ability of robot agents to adapt
to unseen partners [7, 8], who may differ in knowledge,
skill, and behavior. Prior work [7] proposes a general pur-
pose algorithm that reuses knowledge learned from previous
teammates or experts to quickly adapt to new teammates. The
approach takes two forms: (1) model-based, which develops
a model of previous teammates’ behaviors to predict and
plan in online, and (2) policy-based, which learns policies
for previous teammates and selects an appropriate policy
online. Another important challenge in ad-hoc teaming is
modeling uncertainty over partner characteristics [9, 10]. In
the Overcooked environment, [4] showed that incorporating
human models learned from data improves the performance
of agents compared to agents trained to play with themselves.
Instead of training agents to partner with a general human
proxy model as in [4], we train a library of strategy-specific
agent policies that represent different coordination behavior
patterns. Distinguishing strategy allows for a policy library
that captures differences in team coordination patterns that
may otherwise wash out in a single general model.
2) Multi-agent Reinforcement Learning: In cooperative
multi-agent settings, self-play (SP) trains a team of agents
that work well together. A collaborative agent that excels
with the partners with which it was trained may not gen-
eralize well to new partners at test time, especially when
the new partners differ significantly from the pool used for
training [11]. Other-play (OP) [12] addresses this problem,
demonstrating improved zero-shot coordination with human-
AI performance on the Hanabi game [13]. A self-play train-
ing paradigm that assembles agents representing untrained,
partially trained, and fully trained partners by extracting
agent models at different checkpoints in the training duration
has been shown to produce robust agents trained on the
suite of partners [14]. Prior work [15] models opponents in
deep multi-agent reinforcement learning settings by training
neural-based models on the hidden state observations of
opponents. A Mixture-of-Experts architecture maintains a
distribution over different opponent strategies, allowing this
model to integrate different strategy patterns.
3) Adaptation in Human-Robot Interaction: Past research
has studied how robots can adapt and learn from human
partners. Key to robot-to-human adaptation is understand-
ing people’s behavior through observation. Markov Deci-
sion Processes (MDPs) are a common framework for goal
recognition [16]. By learning a model of human intent and
preferences [17], robots can reason over different types of
human partners [18, 19]. Similar in vein to our work, [20]
applied a best-response approach to selecting policies from
a library of response policies that best match a particular
player type. Building an understanding of the human partner
requires multi-faceted models of humans that capture nu-
anced differences. Our work on adaptation focuses primarily
on adapting robot behavior to the task approach (strategy) of
a human partner. Our adaptation approach is similar to [21],
Cramped Room
Forced CoordinationCoordination Ring Counter Circuit
Asymmetric Advantages
Fig. 2: The Overcooked experimental layouts. Environments vary in the
amount of constrained space, actions available to different player positions,
and interdependence of player actions to achieve the objective.
where human demonstrations are clustered into dominant
types and a reward function is learned for each type, for
which Bayesian inference is used to adapt to new users.
III. PRELIMINARIES
1) Task Scenario: In order to study human-robot col-
laboration, we study the Overcooked environment [4], a
collaborative cooking task. Dyads (consisting of robot agents
or humans) collaborate in a constrained shared environment
(Fig 2). Their objective is to prepare an order (onion soup)
and serve it as many times as possible in an allotted time.
2) Strategies: In the Overcooked task, agents must per-
form sequences of high-level tasks to serve orders. Examples
of high-level tasks include picking up onions and plates,
placing onions into pots, and serving soup. Each high-level
task requires a sequence of lower level subtasks (i.e. motion
primitives). Teams collaborate on shared tasks in different
ways. For example, in role specialization, players take sole
responsibility for particular tasks, whereas in complete-as-
needed approaches, each partner performs the next required
task. In addition to role-oriented strategies, collaborative
approaches also prescribe the order in which tasks are
performed. Teams that serve dishes while the next orders
are cooking employ more time-efficient strategies. We define
collaborative strategies as the sequence in which high-level
tasks are interleaved and distributed across teammates. Since
actions of all team members are involved in task approach,
strategy is computed at the team level.
3) MDP Formulation: The task is modeled as a two-
player Markov decision process (MDP) defined by tuple
hSS, A={A1,A2},T, Ri.SS is the set of states. The
action space of a game with two players is A=A1×A2. The
set of actions available to each player iis Ai. The transition
function Tdetermines how the state changes based on a
joint action by both players, T:SS ×(A1,A2)→SS.
R:SS →Ris the team reward function. πirepresents agent
i’s policy. Z={z1, ..., zK}represents the set of possible
team collaborative strategies. We further denote a policy that
corresponds to strategy zkas πk.
IV. APPROACH
We introduce MESH (Matching Emergent Strategies to
Humans) as an approach for coordination of collaborative