Monte Carlo Tree Search
The MCTS algorithm has shown to be widely effective in
creating gaming AI agents. The basis of this algorithm in-
cludes building an asymmetric tree and searching the state
space for the optimal solution while simulating the game and
transferring the outcome of each episode to nodes involved
in that episode.
Coulom (2006) combined Monte Carlo evaluation with
tree search to create an agent that was able to play the game
of GO. This resulted in the creation of the Monte Carlo tree
search algorithm. Chaslot et al. (2008) proposed the use of
MCTS in gaming applications. A variation of the original
MCTS is the Upper Confidence bounds applied to Trees
(UCT) algorithm. UCT uses UCBI selection as the policy
for selecting the next node. This algorithm is the original
UCT. There are also variations of UCT such as standard
UCT. Standard UCT only stores one state in the memory
while original UCT stores all the visited states when
memory is available and does not discount rewards in con-
trast to original UCT. Moreover, standard UCT treats the in-
termediate rewards and final rewards as the same and uses
the sum of all rewards in the backpropagation step and up-
dates all the states with the same value, while original UCT
updates each state by reward and return calculated for that
state (Vodopivec et al. 2017).
The MCTS algorithm and its variations have been em-
ployed in many games including Chess, Poker, and commer-
cial video games such as StarCraft (Świechowski et al.
2022). One of the advantages of MCTS is that does not need
domain-specific knowledge to perform. More recently,
MCTS has been successfully deployed in the imperfect in-
formation card game Cribbage (Kelly and Churchill 2017),
the board game Terraforming Mars (Gaina et al. 2021), and
level generation problems (Bhaumik et al. 2020), among
many other applications.
Reinforcement Learning
In reinforcement learning, an agent learns to decide what ac-
tion to take in each step while interacting with the environ-
ment and receiving rewards. The goal of reinforcement
learning algorithms is to maximize the cumulative reward
signal (Sutton and Barto 2018). Two important functions
that are frequently used in reinforcement learning algo-
rithms are the value function and policy function. A value
function of a state can be defined as the expected cumulative
reward if the agent starts from that state. A policy function
is a mapping from a state to an action. The algorithms in
reinforcement learning can be categorized into different cat-
egories based on certain characteristics. One categorization
is based on the presence of a model of the environment.
Model-based methods use a model of the environment while
model-free methods do not work with a model. In model-
based approaches, a model is explicitly defined with the
transition probability distribution and the reward function.
A model of the environment can predict the next state and
the next reward using the current state and action. In model-
free approaches, state values are directly learned without un-
derlining assumptions about a model.
At the center of the model-free algorithms, there is tem-
poral difference (TD) learning, which uses ideas from both
Monte Carlo methods and dynamic programming (Sutton
and Barto 2018). Temporal difference learning uses a con-
cept named “bootstrapping” in which it can update estimates
based on other estimates and it does not need to wait until
the end of a game to get an outcome from the environment
(Sutton and Barto 2018). One of the notable successes of
temporal difference learning is TD-Gammon which was
able to play backgammon at the level of the world champi-
onship (Tesauro 1995, Galway et al. 2008). In TD-Gammon,
a neural network acts as an evaluation function for valid
moves and is trained by the temporal difference approach.
Many of today’s state-of-art reinforcement learning algo-
rithms use temporal difference learning as part of their learn-
ing mechanism. Google’s DeepMind built an agent that can
play a set of Atari games. This deep reinforcement learning
algorithm named Deep Q Learning can learn policies from
high-dimensional input states such as images. The Q-net-
work in this algorithm uses temporal difference as part of its
learning algorithm for updating the weights (Mnih et al.
2013). Researchers have since applied reinforcement learn-
ing algorithms to many gaming applications (You et al.
2020), procedural content generation tasks (Khalifa et al.
2020), and game design and automated test challenges
(Gutiérrez-Sánchez et al. 2021). While reinforcement learn-
ing approaches that utilize deep neural networks are power-
ful, their substantial training time is not suitable for the re-
search question we want to explore, which is on-line adap-
tation.
Intersection of MCTS and Reinforcement Learning
Researchers have examined the combined use of MCTS and
reinforcement learning. One of the most prominent exam-
ples of MCTS and reinforcement learning is the AlphaGo
(Silver et al. 2016) algorithm, which uses two neural net-
works for policy and value functions that are pre-trained
with human player data as well as self-play. During on-line
play, the neural networks continue to be refined by using a
UCT-like algorithm. This algorithm was able to defeat the
best human players in Go (Vodopivec et al. 2017). Ilhan and
Etaner-Uyar (2017) proposed an approach that used MCTS
rollouts as the agent's past experience and used this past ex-
perience in their reinforcement learning algorithm. They as-
sumed that the forward model of the environment was ac-
cessible. To achieve our goal which is rapid adaptation to
different strategies used by opponents, we cannot assume a