Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors Mohammad Reza Taesiri Finlay Macklon Yihe Wang Hengshuo Shen Cor-Paul Bezemer

2025-05-03 0 0 583.79KB 9 页 10玖币
侵权投诉
Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors
Mohammad Reza Taesiri Finlay Macklon Yihe Wang Hengshuo Shen
Cor-Paul Bezemer
University of Alberta
{taesiri,macklon,yihe2,hengshuo,bezemer}@ualberta.ca
Abstract
Video game testing requires game-specific knowledge as well
as common sense reasoning about the events in the game.
While AI-driven agents can satisfy the first requirement, it
is not yet possible to meet the second requirement automati-
cally. Therefore, video game testing often still relies on man-
ual testing, and human testers are required to play the game
thoroughly to detect bugs. As a result, it is challenging to fully
automate game testing. In this study, we explore the possibil-
ity of leveraging the zero-shot capabilities of large language
models for video game bug detection. By formulating the bug
detection problem as a question-answering task, we show that
large language models can identify which event is buggy in
a sequence of textual descriptions of events from a game.
To this end, we introduce the GameBugDescriptions
benchmark dataset, which consists of 167 buggy gameplay
videos and a total of 334 question-answer pairs across 8
games. We extensively evaluate the performance of six mod-
els across the OPT and InstructGPT large language model
families on our benchmark dataset. Our results show promis-
ing results for employing language models to detect video
game bugs. With the proper prompting technique, we could
achieve an accuracy of 70.66%, and on some video games,
up to 78.94%. Our code, evaluation data and the benchmark
can be found on https://asgaardlab.github.io/LLMxBugs
Introduction
Similar to other software products, a video game must be
thoroughly tested to assure its quality. Game testing is an
umbrella term for many types of tests that cover different
aspects of the game. For example, a rendering test aims to
verify the visual quality of the output, whereas a gameplay
test assesses whether the game is engaging enough. While
it is possible to automate some game testing elements, e.g.,
by using advanced vision models to detect graphical issues
automatically (Taesiri, Macklon, and Bezemer 2022), most
game testing aspects still require a human tester (Pascarella
et al. 2018). Two of the main challenges that prevent the
automation of game testing are the difficulty to automate
(1) knowledge about the game context and (2) common
sense reasoning (Politowski, Petrillo, and Guéhéneuc 2021).
Many video games rely on a physics engine that defines
the rules of the world in which the game is situated (Milling-
1Source: https://redd.it/2s5xon
a
c
b
Q: In the Grand The Auto V video game, the following sequence of
events happened:
(a) A person is parachuting in the air.
(b) A plane approaches the parachuter.
(c) The plane hits the cord and loses its right wing.
(d) The plane falls from the sky.
Which event is a bug?
A: The plane hitting the cord and losing its right wing is a bug.
Among (a) through (d), the answer is (c).
d
Figure 1: An example of using a large language model to
detect a video game bug by classifying a sequence of events
in the Grand Theft Auto V1video game in which a collision
between a plane and parachute cords leads to the plane los-
ing its right wing. The highlighted text shows the response
of the davinci model from the InstructGPT family.
ton 2007). For some games, there are sharp contrasts be-
tween the game world and the natural laws of the real world.
These differences make it hard to reason about events in
video games without knowing the game context. For exam-
ple, is it a bug that the player survives after falling from a
very high height? Answering such a question is impossible
without having knowledge about the target video game.
In this study, we propose using the game context knowl-
edge and common sense reasoning capabilities of large lan-
guage models (LLMs) to identify buggy events in video
games and classify their bug type. Recent revolutions in nat-
arXiv:2210.02506v1 [cs.CL] 5 Oct 2022
ural language processing (NLP) show that scaling up lan-
guage models is beneficial in many tasks (Vaswani et al.
2017; Devlin et al. 2018; Rae et al. 2021; Chowdhery et al.
2022; Thoppilan et al. 2022), such as few-shot and zero-shot
learning (Brown et al. 2020; Kojima et al. 2022). Having
been trained on very large datasets, LLMs have the poten-
tial to capture many details about topics in their training set,
including video games. Figure 1 shows an example of suc-
cessful bug detection by a language model.
We are the first to empirically evaluate the capability of
LLMs as zero-shot video game bug detectors. Our main con-
tributions are as follows:
1. We present the GameBugDescriptions dataset, the
first dataset of videos of game bugs with a step-by-
step textual description for bug detection purposes. This
dataset can serve as an out-of-distribution (OOD) chal-
lenge for LLMs.
2. We are the first to show that large language models have
promising capabilities to detect video game bugs.
3. We extensively evaluate the performance of two fami-
lies of large language models on the bug detection and
bug type classification tasks: InstructGPT (Ouyang et al.
2022) and OPT (Zhang et al. 2022).
4. We analyze the robustness of language models to differ-
ent descriptions of the same event for these tasks.
Our study demonstrates the promising capabilities of
LLMs to play an important role in the automation of the
game testing process.
Background and Related Work
Our work bridges the language modeling, video game, and
software engineering research communities. In this section,
we provide a brief overview of the relevant literature across
these disciplines, in particular, on large language models and
prompt engineering, and automated game testing.
Large Language Models and Prompt Engineering
The training objective in a language model is to learn a prob-
ability distribution over some text corpus. Such a simple
training objective combined with sufficient model scaling
can yield large language models that are successful even for
tasks for which the model was not explicitly trained (Ka-
plan et al. 2020; Brown et al. 2020; Chowdhery et al. 2022;
Thoppilan et al. 2022).
Prompting or prompt engineering (Liu et al. 2021) is an
effective technique wherein we condition a language model
on a set of manually handcrafted (Schick and Schütze 2020;
Kojima et al. 2022) or automated (Gao, Fisch, and Chen
2020) templates to solve new tasks. That is, new tasks can
be solved by giving natural language instructions to a pre-
trained model without any further training, e.g., by provid-
ing sample reasoning steps to the model (Wei et al. 2022) in
a few-shot setting. Moreover, Kojima et al. (2022) showed
that even with a prompting technique as simple as adding
“Let’s think step by step” to the beginning of the answer, it is
possible to trigger the reasoning in language models, which
leads to higher accuracy improvement on multiple bench-
marks in zero-shot setting. Using graphical models Dohan
et al. (2022) introduced a general formulation for prompted
models, enabling probabilistic programming with LLMs.
Several successful applications of LLMs include program
synthesis (Jain et al. 2022), code generation (Chen et al.
2021) or chatbots (Thoppilan et al. 2022). However, we are
the first to apply LLMs to detect bugs in video games.
Automated Game Testing
As shown by prior work, automated game testing is chal-
lenging because game-specific knowledge and common
sense reasoning are required to detect and report bugs (Pas-
carella et al. 2018; Politowski, Petrillo, and Guéhéneuc
2021). The majority of prior work on automated game test-
ing focuses on methods to automatically play games, such
as heuristic search strategies (Keehl and Smith 2019). Auto-
mated play techniques using reinforcement learning or evo-
lutionary strategies (Zheng et al. 2019; Vinyals et al. 2019;
Berner et al. 2019; Justesen et al. 2019) allow the testing
of video games from different perspectives, such as playa-
bility, game balance, and even predicting user churn rate
(Roohi et al. 2020, 2021). However, these methods are of-
ten designed to maximize a certain reward function, which
might lead to progress in the game in an unintended manner
and even break the game’s rules or physics engine (Baker
et al. 2020; Clark and Amodei 2019). More importantly,
these methods do not have common sense reasoning.
Other prior work has leveraged computer vision and NLP
techniques for automated video game testing. Several stud-
ies have proposed approaches for graphical bug detection us-
ing deep learning (Ling, Tollmar, and Gisslén 2020; Taesiri,
Habibi, and Fazli 2020) or digital image processing (Mack-
lon et al. 2022). However, these approaches do not re-
quire common sense reasoning. For example, Macklon et al.
(2022) rely on graphical assets of the game as a test oracle.
Several other studies have proposed approaches to re-
trieve moments from gameplay videos based on text
queries (Zhang and Smith 2019; Taesiri, Macklon, and Beze-
mer 2022). However, to detect bugs with these approaches
the bug instance must be known in advance, and therefore
these gameplay event retrieval approaches do not allow for
automated detection of (previously undiscovered) bugs. Our
approach does not have this requirement and can therefore
be used to identify previously undiscovered bugs.
Finally, prior work has proposed NLP-based approaches
to automatically improve test case descriptions for manual
playtesting of games (Viggiato et al. 2022a,b), but we are
the first to leverage LLMs for bug detection in video games.
Bug Detection with Large Language Models
To automatically identify buggy events in a video game, we
propose using LLMs to reason about sequences of textual
descriptions of game events. We formulate the problem as a
question-answering (Q&A) task (Srivastava et al. 2022) for
LLMs. Here, we explain how we convert textual descriptions
of a sequence of game events into a multiple-choice question
and use a language model to identify the buggy event. In ad-
dition, we discuss how LLMs can assist us to classify the bug
摘要:

LargeLanguageModelsarePrettyGoodZero-ShotVideoGameBugDetectorsMohammadRezaTaesiriFinlayMacklonYiheWangHengshuoShenCor-PaulBezemerUniversityofAlberta{taesiri,macklon,yihe2,hengshuo,bezemer}@ualberta.caAbstractVideogametestingrequiresgame-specicknowledgeaswellascommonsensereasoningabouttheeventsinthe...

展开>> 收起<<
Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors Mohammad Reza Taesiri Finlay Macklon Yihe Wang Hengshuo Shen Cor-Paul Bezemer.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:583.79KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注