
Preprint
In addition, after jointly grounding the text manual and the language-based goal (goal instruction) to
environment entities, each entity has been associated with explicit dynamic rules and relationships
with the goal state. Thus, the entities with language grounding can also be utilized to form better
strategies.
We present two goal-based multi-agent environments based on two previous single-agent settings,
i.e., MESSENGER (Hanjie et al.,2021) and RTFM (Zhong et al.,2019), which require generaliza-
tion to new dynamics (i.e., how entities behave), entity references, and partially observable envi-
ronments. Agents are given a document that specifies environment dynamics and a language-based
goal. Note that one goal may contain multiple subgoals. Thus, one single agent may struggle or be
unable to finish it. In particular, after identifying relevant information in the language descriptions,
agents need to decompose the general goal into many subgoals and figure out the optimal subgoal di-
vision strategy. Note that we focus on interactive environments that are easily converted to symbolic
representations, instead of raw visual observations, for efficiency, interpretability, and emphasis on
abstractions over perception.
In this paper, we propose a novel framework for language grounding in multi-agent reinforcement
learning (MARL), entity divider (EnDi), to enable agents independently learn subgoal division
strategies at the entity level. Specifically, each EnDi agent first generates a language-based rep-
resentation for the environment and decomposes the goal instruction into two subgoals: self and
others. Note that the subgoal can be described at the entity level since language descriptions have
given the explicit relationship between the goal and all entities. Then, the EnDi agent acts in the
environment based on the associated entities of the self subgoal. To consider the influence of oth-
ers, the EnDi agent has two policy heads. One is to interact with the environment, and another is
for opponent modeling. The EnDi agent is jointly trained end-to-end using reinforcement learning
and supervised learning for two policy heads, respectively. The gradient signal of the supervised
learning from the opponent modeling is used to regularize the subgoal division of others.
Our framework is the first attempt to address the challenges of grounding language for generalization
to unseen dynamics in multi-agent settings. EnDi can be instantiated on many existing language
grounding modules and is currently built and evaluated in two multi-agent environments mentioned
above. Empirically, we demonstrate that EnDi outperforms existing language-based methods in all
tasks by a large margin. Importantly, EnDi also expresses the best generalization ability on unseen
games, i.e., zero-shot transfer. By ablation studies, we verify the effectiveness of each component,
and EnDi indeed can obtain coordinated subgoal division strategies by opponent modeling. We also
argue that many language grounding problems can be addressed at the entity level.
2 Related Work
Language grounded policy-learning. Language grounding refers to learning the meaning of nat-
ural language units, e.g., utterances, phrases, or words, by leveraging the non-linguistic context. In
many previous works (Wang et al.,2019;Blukis et al.,2019;Janner et al.,2018;K¨
uttler et al.,2020;
Tellex et al.,2020;Branavan et al.,2012), the text conveys the goal or instruction to the agent, and
the agent produces behaviors in response after the language grounding. Thus, it encourages a strong
connection between the given instruction and the policy.
More recently, many works have extensively explored the generalization from many different per-
spectives. Hill et al. (2020a;2019;2020b) investigated the generalization regarding novel entity
combinations, from synthetic template commands to natural instructions given by humans and the
number of objects. Choi et al. (2021) proposed a language-guided policy learning algorithm, en-
abling learning new tasks quickly with language corrections. In addition, Co-Reyes et al. (2018)
proposed to guide policies by language to generalize on new tasks by meta learning. Huang et al.
(2022) utilized the generalization of large language models to achieve zero-shot planners.
However, all these works may not generalize to a new distribution of dynamics or tasks since they
encourage a strong connection between the given instruction and the policy, not the dynamics of the
environment.
Language grounding to dynamics of environments. A different line of research has focused
on utilizing manuals as auxiliary information to aid generalization. These text manuals provide
2