Exploring Adaptive MCTS with TD Learning in miniXCOM_2

2025-04-27 0 0 461.87KB 7 页 10玖币
侵权投诉
AIIDE Workshop on Experimental AI in Games (EXAG) 2022
Exploring Adaptive MCTS with TD Learning in miniXCOM
Kimiya Saadat and Richard Zhao
Department of Computer Science, University of Calgary, Calgary, AB, Canada
{ kimiya.saadat, richard.zhao1 }@ucalgary.ca
Abstract
In recent years, Monte Carlo tree search (MCTS) has
achieved widespread adoption within the game community.
Its use in conjunction with deep reinforcement learning has
produced success stories in many applications. While these
approaches have been implemented in various games, from
simple board games to more complicated video games such
as StarCraft, the use of deep neural networks requires a sub-
stantial training period. In this work, we explore on-line
adaptivity in MCTS without requiring pre-training. We pre-
sent MCTS-TD, an adaptive MCTS algorithm improved with
temporal difference learning. We demonstrate our new ap-
proach on the game miniXCOM, a simplified version of
XCOM, a popular commercial franchise consisting of several
turn-based tactical games, and show how adaptivity in
MCTS-TD allows for improved performances against oppo-
nents.
Introduction
Games are suitable platforms for evaluating algorithms of
artificial intelligence (AI). Games are often in environments
with unambiguous rules and without external interference.
They can capture the essence of real-world scenarios while
maintaining a well-defined environment. AI agents that
have performed well in games have been adapted to work in
non-game applications (Livingston and Risse 2019). How-
ever, game playing as an AI problem can be extremely chal-
lenging due to the complexity of the game worlds. This is an
interesting research area that has attracted many research-
ers’ attention.
In recent years, Monte Carlo tree search (MCTS) has been
adapted to great success in many different game applica-
tions. Its use in conjunction with deep reinforcement learn-
ing has produced success stories in many applications
(Kartal et al. 2019). However, the use of deep neural net-
works requires substantial pre-training in general. While
pre-training is possible for many applications, it is not al-
ways possible for an AI agent to have access to an environ-
ment for pre-training. Examining the ability of a game-play-
ing AI agent to adapt to a game and its opponents on-line
and during gameplay without any prior knowledge is an in-
teresting area of research.
In this work, we present MCTS-TD, an adaptive MCTS
algorithm improved with temporal difference learning. Our
proposed algorithm combines both the on-line nature of
MCTS and the adaptive advantages of reinforcement learn-
ing, taking the best of both worlds.
Our research question is: can MCTS be guided by rein-
forcement learning so that it effectively adapts to its oppo-
nentsstrategies in an on-line fashion? We answer the re-
search question by conducting experiments with MCTS-TD,
comparing it to the original MCTS, another approach called
SARSA-UCT, and a rule-based approach on the game
miniXCOM, a turn-based grid-based tactical shooting game.
This paper makes the following contributions:
1. MCTS-TD, an improved MCTS algorithm that uti-
lizes reinforcement learning to obtain estimated util-
ity values of game states, and uses the utility values
to guide the search in MCTS.
2. An empirical evaluation of MCTS-TD in a turn-
based game to demonstrate its on-line adaptivity in
different scenarios.
Related Works
Researchers have deployed a variety of AI techniques for
games. Search (Pereira et al. 2021), planning (Sauma-
Chacón et al. 2020), and learning (Sieusahai et al. 2021) are
all popular approaches being utilized to this day. MCTS
(Coulom 2006) is a search algorithm that has become the
focus of much research in gaming AI. In this section, we
first explore the origin of MCTS and its variants used in cre-
ating game-playing AI. Next, reinforcement learning and
temporal difference learning are introduced, alongside re-
cent use cases. Finally, we look at notable intersections be-
tween MCTS and reinforcement learning.
Monte Carlo Tree Search
The MCTS algorithm has shown to be widely effective in
creating gaming AI agents. The basis of this algorithm in-
cludes building an asymmetric tree and searching the state
space for the optimal solution while simulating the game and
transferring the outcome of each episode to nodes involved
in that episode.
Coulom (2006) combined Monte Carlo evaluation with
tree search to create an agent that was able to play the game
of GO. This resulted in the creation of the Monte Carlo tree
search algorithm. Chaslot et al. (2008) proposed the use of
MCTS in gaming applications. A variation of the original
MCTS is the Upper Confidence bounds applied to Trees
(UCT) algorithm. UCT uses UCBI selection as the policy
for selecting the next node. This algorithm is the original
UCT. There are also variations of UCT such as standard
UCT. Standard UCT only stores one state in the memory
while original UCT stores all the visited states when
memory is available and does not discount rewards in con-
trast to original UCT. Moreover, standard UCT treats the in-
termediate rewards and final rewards as the same and uses
the sum of all rewards in the backpropagation step and up-
dates all the states with the same value, while original UCT
updates each state by reward and return calculated for that
state (Vodopivec et al. 2017).
The MCTS algorithm and its variations have been em-
ployed in many games including Chess, Poker, and commer-
cial video games such as StarCraft (Świechowski et al.
2022). One of the advantages of MCTS is that does not need
domain-specific knowledge to perform. More recently,
MCTS has been successfully deployed in the imperfect in-
formation card game Cribbage (Kelly and Churchill 2017),
the board game Terraforming Mars (Gaina et al. 2021), and
level generation problems (Bhaumik et al. 2020), among
many other applications.
Reinforcement Learning
In reinforcement learning, an agent learns to decide what ac-
tion to take in each step while interacting with the environ-
ment and receiving rewards. The goal of reinforcement
learning algorithms is to maximize the cumulative reward
signal (Sutton and Barto 2018). Two important functions
that are frequently used in reinforcement learning algo-
rithms are the value function and policy function. A value
function of a state can be defined as the expected cumulative
reward if the agent starts from that state. A policy function
is a mapping from a state to an action. The algorithms in
reinforcement learning can be categorized into different cat-
egories based on certain characteristics. One categorization
is based on the presence of a model of the environment.
Model-based methods use a model of the environment while
model-free methods do not work with a model. In model-
based approaches, a model is explicitly defined with the
transition probability distribution and the reward function.
A model of the environment can predict the next state and
the next reward using the current state and action. In model-
free approaches, state values are directly learned without un-
derlining assumptions about a model.
At the center of the model-free algorithms, there is tem-
poral difference (TD) learning, which uses ideas from both
Monte Carlo methods and dynamic programming (Sutton
and Barto 2018). Temporal difference learning uses a con-
cept named “bootstrapping” in which it can update estimates
based on other estimates and it does not need to wait until
the end of a game to get an outcome from the environment
(Sutton and Barto 2018). One of the notable successes of
temporal difference learning is TD-Gammon which was
able to play backgammon at the level of the world champi-
onship (Tesauro 1995, Galway et al. 2008). In TD-Gammon,
a neural network acts as an evaluation function for valid
moves and is trained by the temporal difference approach.
Many of today’s state-of-art reinforcement learning algo-
rithms use temporal difference learning as part of their learn-
ing mechanism. Google’s DeepMind built an agent that can
play a set of Atari games. This deep reinforcement learning
algorithm named Deep Q Learning can learn policies from
high-dimensional input states such as images. The Q-net-
work in this algorithm uses temporal difference as part of its
learning algorithm for updating the weights (Mnih et al.
2013). Researchers have since applied reinforcement learn-
ing algorithms to many gaming applications (You et al.
2020), procedural content generation tasks (Khalifa et al.
2020), and game design and automated test challenges
(Gutiérrez-Sánchez et al. 2021). While reinforcement learn-
ing approaches that utilize deep neural networks are power-
ful, their substantial training time is not suitable for the re-
search question we want to explore, which is on-line adap-
tation.
Intersection of MCTS and Reinforcement Learning
Researchers have examined the combined use of MCTS and
reinforcement learning. One of the most prominent exam-
ples of MCTS and reinforcement learning is the AlphaGo
(Silver et al. 2016) algorithm, which uses two neural net-
works for policy and value functions that are pre-trained
with human player data as well as self-play. During on-line
play, the neural networks continue to be refined by using a
UCT-like algorithm. This algorithm was able to defeat the
best human players in Go (Vodopivec et al. 2017). Ilhan and
Etaner-Uyar (2017) proposed an approach that used MCTS
rollouts as the agent's past experience and used this past ex-
perience in their reinforcement learning algorithm. They as-
sumed that the forward model of the environment was ac-
cessible. To achieve our goal which is rapid adaptation to
different strategies used by opponents, we cannot assume a
摘要:

AIIDEWorkshoponExperimentalAIinGames(EXAG)2022ExploringAdaptiveMCTSwithTDLearninginminiXCOMKimiyaSaadatandRichardZhaoDepartmentofComputerScience,UniversityofCalgary,Calgary,AB,Canada{kimiya.saadat,richard.zhao1}@ucalgary.caAbstractInrecentyears,MonteCarlotreesearch(MCTS)hasachievedwidespreadadoption...

展开>> 收起<<
Exploring Adaptive MCTS with TD Learning in miniXCOM_2.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:461.87KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注