ural language processing (NLP) show that scaling up lan-
guage models is beneficial in many tasks (Vaswani et al.
2017; Devlin et al. 2018; Rae et al. 2021; Chowdhery et al.
2022; Thoppilan et al. 2022), such as few-shot and zero-shot
learning (Brown et al. 2020; Kojima et al. 2022). Having
been trained on very large datasets, LLMs have the poten-
tial to capture many details about topics in their training set,
including video games. Figure 1 shows an example of suc-
cessful bug detection by a language model.
We are the first to empirically evaluate the capability of
LLMs as zero-shot video game bug detectors. Our main con-
tributions are as follows:
1. We present the GameBugDescriptions dataset, the
first dataset of videos of game bugs with a step-by-
step textual description for bug detection purposes. This
dataset can serve as an out-of-distribution (OOD) chal-
lenge for LLMs.
2. We are the first to show that large language models have
promising capabilities to detect video game bugs.
3. We extensively evaluate the performance of two fami-
lies of large language models on the bug detection and
bug type classification tasks: InstructGPT (Ouyang et al.
2022) and OPT (Zhang et al. 2022).
4. We analyze the robustness of language models to differ-
ent descriptions of the same event for these tasks.
Our study demonstrates the promising capabilities of
LLMs to play an important role in the automation of the
game testing process.
Background and Related Work
Our work bridges the language modeling, video game, and
software engineering research communities. In this section,
we provide a brief overview of the relevant literature across
these disciplines, in particular, on large language models and
prompt engineering, and automated game testing.
Large Language Models and Prompt Engineering
The training objective in a language model is to learn a prob-
ability distribution over some text corpus. Such a simple
training objective combined with sufficient model scaling
can yield large language models that are successful even for
tasks for which the model was not explicitly trained (Ka-
plan et al. 2020; Brown et al. 2020; Chowdhery et al. 2022;
Thoppilan et al. 2022).
Prompting or prompt engineering (Liu et al. 2021) is an
effective technique wherein we condition a language model
on a set of manually handcrafted (Schick and Schütze 2020;
Kojima et al. 2022) or automated (Gao, Fisch, and Chen
2020) templates to solve new tasks. That is, new tasks can
be solved by giving natural language instructions to a pre-
trained model without any further training, e.g., by provid-
ing sample reasoning steps to the model (Wei et al. 2022) in
a few-shot setting. Moreover, Kojima et al. (2022) showed
that even with a prompting technique as simple as adding
“Let’s think step by step” to the beginning of the answer, it is
possible to trigger the reasoning in language models, which
leads to higher accuracy improvement on multiple bench-
marks in zero-shot setting. Using graphical models Dohan
et al. (2022) introduced a general formulation for prompted
models, enabling probabilistic programming with LLMs.
Several successful applications of LLMs include program
synthesis (Jain et al. 2022), code generation (Chen et al.
2021) or chatbots (Thoppilan et al. 2022). However, we are
the first to apply LLMs to detect bugs in video games.
Automated Game Testing
As shown by prior work, automated game testing is chal-
lenging because game-specific knowledge and common
sense reasoning are required to detect and report bugs (Pas-
carella et al. 2018; Politowski, Petrillo, and Guéhéneuc
2021). The majority of prior work on automated game test-
ing focuses on methods to automatically play games, such
as heuristic search strategies (Keehl and Smith 2019). Auto-
mated play techniques using reinforcement learning or evo-
lutionary strategies (Zheng et al. 2019; Vinyals et al. 2019;
Berner et al. 2019; Justesen et al. 2019) allow the testing
of video games from different perspectives, such as playa-
bility, game balance, and even predicting user churn rate
(Roohi et al. 2020, 2021). However, these methods are of-
ten designed to maximize a certain reward function, which
might lead to progress in the game in an unintended manner
and even break the game’s rules or physics engine (Baker
et al. 2020; Clark and Amodei 2019). More importantly,
these methods do not have common sense reasoning.
Other prior work has leveraged computer vision and NLP
techniques for automated video game testing. Several stud-
ies have proposed approaches for graphical bug detection us-
ing deep learning (Ling, Tollmar, and Gisslén 2020; Taesiri,
Habibi, and Fazli 2020) or digital image processing (Mack-
lon et al. 2022). However, these approaches do not re-
quire common sense reasoning. For example, Macklon et al.
(2022) rely on graphical assets of the game as a test oracle.
Several other studies have proposed approaches to re-
trieve moments from gameplay videos based on text
queries (Zhang and Smith 2019; Taesiri, Macklon, and Beze-
mer 2022). However, to detect bugs with these approaches
the bug instance must be known in advance, and therefore
these gameplay event retrieval approaches do not allow for
automated detection of (previously undiscovered) bugs. Our
approach does not have this requirement and can therefore
be used to identify previously undiscovered bugs.
Finally, prior work has proposed NLP-based approaches
to automatically improve test case descriptions for manual
playtesting of games (Viggiato et al. 2022a,b), but we are
the first to leverage LLMs for bug detection in video games.
Bug Detection with Large Language Models
To automatically identify buggy events in a video game, we
propose using LLMs to reason about sequences of textual
descriptions of game events. We formulate the problem as a
question-answering (Q&A) task (Srivastava et al. 2022) for
LLMs. Here, we explain how we convert textual descriptions
of a sequence of game events into a multiple-choice question
and use a language model to identify the buggy event. In ad-
dition, we discuss how LLMs can assist us to classify the bug