Exploring Adaptive MCTS with TD Learning in miniXCOM_2

2025-04-27 0 0 461.87KB 7 页 10玖币

侵权投诉

AIIDE Workshop on Experimental AI in Games (EXAG) 2022

Exploring Adaptive MCTS with TD Learning in miniXCOM

Kimiya Saadat and Richard Zhao

Department of Computer Science, University of Calgary, Calgary, AB, Canada

{ kimiya.saadat, richard.zhao1 }@ucalgary.ca

Abstract

In recent years, Monte Carlo tree search (MCTS) has

achieved widespread adoption within the game community.

Its use in conjunction with deep reinforcement learning has

produced success stories in many applications. While these

approaches have been implemented in various games, from

simple board games to more complicated video games such

as StarCraft, the use of deep neural networks requires a sub-

stantial training period. In this work, we explore on-line

adaptivity in MCTS without requiring pre-training. We pre-

sent MCTS-TD, an adaptive MCTS algorithm improved with

temporal difference learning. We demonstrate our new ap-

proach on the game miniXCOM, a simplified version of

XCOM, a popular commercial franchise consisting of several

turn-based tactical games, and show how adaptivity in

MCTS-TD allows for improved performances against oppo-

nents.

Introduction

Games are suitable platforms for evaluating algorithms of

artificial intelligence (AI). Games are often in environments

with unambiguous rules and without external interference.

They can capture the essence of real-world scenarios while

maintaining a well-defined environment. AI agents that

have performed well in games have been adapted to work in

non-game applications (Livingston and Risse 2019). How-

ever, game playing as an AI problem can be extremely chal-

lenging due to the complexity of the game worlds. This is an

interesting research area that has attracted many research-

ers’ attention.

In recent years, Monte Carlo tree search (MCTS) has been

adapted to great success in many different game applica-

tions. Its use in conjunction with deep reinforcement learn-

ing has produced success stories in many applications

(Kartal et al. 2019). However, the use of deep neural net-

works requires substantial pre-training in general. While

pre-training is possible for many applications, it is not al-

ways possible for an AI agent to have access to an environ-

ment for pre-training. Examining the ability of a game-play-

ing AI agent to adapt to a game and its opponents on-line

and during gameplay without any prior knowledge is an in-

teresting area of research.

In this work, we present MCTS-TD, an adaptive MCTS

algorithm improved with temporal difference learning. Our

proposed algorithm combines both the on-line nature of

MCTS and the adaptive advantages of reinforcement learn-

ing, taking the best of both worlds.

Our research question is: can MCTS be guided by rein-

forcement learning so that it effectively adapts to its oppo-

nents’ strategies in an on-line fashion? We answer the re-

search question by conducting experiments with MCTS-TD,

comparing it to the original MCTS, another approach called

SARSA-UCT, and a rule-based approach on the game

miniXCOM, a turn-based grid-based tactical shooting game.

This paper makes the following contributions:

1. MCTS-TD, an improved MCTS algorithm that uti-

lizes reinforcement learning to obtain estimated util-

ity values of game states, and uses the utility values

to guide the search in MCTS.

2. An empirical evaluation of MCTS-TD in a turn-

based game to demonstrate its on-line adaptivity in

different scenarios.

Related Works

Researchers have deployed a variety of AI techniques for

games. Search (Pereira et al. 2021), planning (Sauma-

Chacón et al. 2020), and learning (Sieusahai et al. 2021) are

all popular approaches being utilized to this day. MCTS

(Coulom 2006) is a search algorithm that has become the

focus of much research in gaming AI. In this section, we

first explore the origin of MCTS and its variants used in cre-

ating game-playing AI. Next, reinforcement learning and

temporal difference learning are introduced, alongside re-

cent use cases. Finally, we look at notable intersections be-

tween MCTS and reinforcement learning.

Monte Carlo Tree Search

The MCTS algorithm has shown to be widely effective in

creating gaming AI agents. The basis of this algorithm in-

cludes building an asymmetric tree and searching the state

space for the optimal solution while simulating the game and

transferring the outcome of each episode to nodes involved

in that episode.

Coulom (2006) combined Monte Carlo evaluation with

tree search to create an agent that was able to play the game

of GO. This resulted in the creation of the Monte Carlo tree

search algorithm. Chaslot et al. (2008) proposed the use of

MCTS in gaming applications. A variation of the original

MCTS is the Upper Confidence bounds applied to Trees

(UCT) algorithm. UCT uses UCBI selection as the policy

for selecting the next node. This algorithm is the original

UCT. There are also variations of UCT such as standard

UCT. Standard UCT only stores one state in the memory

while original UCT stores all the visited states when

memory is available and does not discount rewards in con-

trast to original UCT. Moreover, standard UCT treats the in-

termediate rewards and final rewards as the same and uses

the sum of all rewards in the backpropagation step and up-

dates all the states with the same value, while original UCT

updates each state by reward and return calculated for that

state (Vodopivec et al. 2017).

The MCTS algorithm and its variations have been em-

ployed in many games including Chess, Poker, and commer-

cial video games such as StarCraft (Świechowski et al.

2022). One of the advantages of MCTS is that does not need

domain-specific knowledge to perform. More recently,

MCTS has been successfully deployed in the imperfect in-

formation card game Cribbage (Kelly and Churchill 2017),

the board game Terraforming Mars (Gaina et al. 2021), and

level generation problems (Bhaumik et al. 2020), among

many other applications.

Reinforcement Learning

In reinforcement learning, an agent learns to decide what ac-

tion to take in each step while interacting with the environ-

ment and receiving rewards. The goal of reinforcement

learning algorithms is to maximize the cumulative reward

signal (Sutton and Barto 2018). Two important functions

that are frequently used in reinforcement learning algo-

rithms are the value function and policy function. A value

function of a state can be defined as the expected cumulative

reward if the agent starts from that state. A policy function

is a mapping from a state to an action. The algorithms in

reinforcement learning can be categorized into different cat-

egories based on certain characteristics. One categorization

is based on the presence of a model of the environment.

Model-based methods use a model of the environment while

model-free methods do not work with a model. In model-

based approaches, a model is explicitly defined with the

transition probability distribution and the reward function.

A model of the environment can predict the next state and

the next reward using the current state and action. In model-

free approaches, state values are directly learned without un-

derlining assumptions about a model.

At the center of the model-free algorithms, there is tem-

poral difference (TD) learning, which uses ideas from both

Monte Carlo methods and dynamic programming (Sutton

and Barto 2018). Temporal difference learning uses a con-

cept named “bootstrapping” in which it can update estimates

based on other estimates and it does not need to wait until

the end of a game to get an outcome from the environment

(Sutton and Barto 2018). One of the notable successes of

temporal difference learning is TD-Gammon which was

able to play backgammon at the level of the world champi-

onship (Tesauro 1995, Galway et al. 2008). In TD-Gammon,

a neural network acts as an evaluation function for valid

moves and is trained by the temporal difference approach.

Many of today’s state-of-art reinforcement learning algo-

rithms use temporal difference learning as part of their learn-

ing mechanism. Google’s DeepMind built an agent that can

play a set of Atari games. This deep reinforcement learning

algorithm named Deep Q Learning can learn policies from

high-dimensional input states such as images. The Q-net-

work in this algorithm uses temporal difference as part of its

learning algorithm for updating the weights (Mnih et al.

2013). Researchers have since applied reinforcement learn-

ing algorithms to many gaming applications (You et al.

2020), procedural content generation tasks (Khalifa et al.

2020), and game design and automated test challenges

(Gutiérrez-Sánchez et al. 2021). While reinforcement learn-

ing approaches that utilize deep neural networks are power-

ful, their substantial training time is not suitable for the re-

search question we want to explore, which is on-line adap-

tation.

Intersection of MCTS and Reinforcement Learning

Researchers have examined the combined use of MCTS and

reinforcement learning. One of the most prominent exam-

ples of MCTS and reinforcement learning is the AlphaGo

(Silver et al. 2016) algorithm, which uses two neural net-

works for policy and value functions that are pre-trained

with human player data as well as self-play. During on-line

play, the neural networks continue to be refined by using a

UCT-like algorithm. This algorithm was able to defeat the

best human players in Go (Vodopivec et al. 2017). Ilhan and

Etaner-Uyar (2017) proposed an approach that used MCTS

rollouts as the agent's past experience and used this past ex-

perience in their reinforcement learning algorithm. They as-

sumed that the forward model of the environment was ac-

cessible. To achieve our goal which is rapid adaptation to

different strategies used by opponents, we cannot assume a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AIIDEWorkshoponExperimentalAIinGames(EXAG)2022ExploringAdaptiveMCTSwithTDLearninginminiXCOMKimiyaSaadatandRichardZhaoDepartmentofComputerScience,UniversityofCalgary,Calgary,AB,Canada{kimiya.saadat,richard.zhao1}@ucalgary.caAbstractInrecentyears,MonteCarlotreesearch(MCTS)hasachievedwidespreadadoption...

展开>> 收起<<

Exploring Adaptive MCTS with TD Learning in miniXCOM_2.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploring Adaptive MCTS with TD Learning in miniXCOM_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: