Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors Mohammad Reza Taesiri Finlay Macklon Yihe Wang Hengshuo Shen Cor-Paul Bezemer

2025-05-03 0 0 583.79KB 9 页 10玖币

侵权投诉

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors

Mohammad Reza Taesiri Finlay Macklon Yihe Wang Hengshuo Shen

Cor-Paul Bezemer

University of Alberta

{taesiri,macklon,yihe2,hengshuo,bezemer}@ualberta.ca

Abstract

Video game testing requires game-speciﬁc knowledge as well

as common sense reasoning about the events in the game.

While AI-driven agents can satisfy the ﬁrst requirement, it

is not yet possible to meet the second requirement automati-

cally. Therefore, video game testing often still relies on man-

ual testing, and human testers are required to play the game

thoroughly to detect bugs. As a result, it is challenging to fully

automate game testing. In this study, we explore the possibil-

ity of leveraging the zero-shot capabilities of large language

models for video game bug detection. By formulating the bug

detection problem as a question-answering task, we show that

large language models can identify which event is buggy in

a sequence of textual descriptions of events from a game.

To this end, we introduce the GameBugDescriptions

benchmark dataset, which consists of 167 buggy gameplay

videos and a total of 334 question-answer pairs across 8

games. We extensively evaluate the performance of six mod-

els across the OPT and InstructGPT large language model

families on our benchmark dataset. Our results show promis-

ing results for employing language models to detect video

game bugs. With the proper prompting technique, we could

achieve an accuracy of 70.66%, and on some video games,

up to 78.94%. Our code, evaluation data and the benchmark

can be found on https://asgaardlab.github.io/LLMxBugs

Introduction

Similar to other software products, a video game must be

thoroughly tested to assure its quality. Game testing is an

umbrella term for many types of tests that cover different

aspects of the game. For example, a rendering test aims to

verify the visual quality of the output, whereas a gameplay

test assesses whether the game is engaging enough. While

it is possible to automate some game testing elements, e.g.,

by using advanced vision models to detect graphical issues

automatically (Taesiri, Macklon, and Bezemer 2022), most

game testing aspects still require a human tester (Pascarella

et al. 2018). Two of the main challenges that prevent the

automation of game testing are the difﬁculty to automate

(1) knowledge about the game context and (2) common

sense reasoning (Politowski, Petrillo, and Guéhéneuc 2021).

Many video games rely on a physics engine that deﬁnes

the rules of the world in which the game is situated (Milling-

1Source: https://redd.it/2s5xon

Q: In the Grand The Auto V video game, the following sequence of

events happened:

(a) A person is parachuting in the air.

(b) A plane approaches the parachuter.

(d) The plane falls from the sky.

Which event is a bug?

A: The plane hitting the cord and losing its right wing is a bug.

Among (a) through (d), the answer is (c).

Figure 1: An example of using a large language model to

detect a video game bug by classifying a sequence of events

in the Grand Theft Auto V1video game in which a collision

between a plane and parachute cords leads to the plane los-

ing its right wing. The highlighted text shows the response

of the davinci model from the InstructGPT family.

ton 2007). For some games, there are sharp contrasts be-

tween the game world and the natural laws of the real world.

These differences make it hard to reason about events in

video games without knowing the game context. For exam-

ple, is it a bug that the player survives after falling from a

very high height? Answering such a question is impossible

without having knowledge about the target video game.

In this study, we propose using the game context knowl-

edge and common sense reasoning capabilities of large lan-

guage models (LLMs) to identify buggy events in video

games and classify their bug type. Recent revolutions in nat-

arXiv:2210.02506v1 [cs.CL] 5 Oct 2022

ural language processing (NLP) show that scaling up lan-

guage models is beneﬁcial in many tasks (Vaswani et al.

2017; Devlin et al. 2018; Rae et al. 2021; Chowdhery et al.

2022; Thoppilan et al. 2022), such as few-shot and zero-shot

learning (Brown et al. 2020; Kojima et al. 2022). Having

been trained on very large datasets, LLMs have the poten-

tial to capture many details about topics in their training set,

including video games. Figure 1 shows an example of suc-

cessful bug detection by a language model.

We are the ﬁrst to empirically evaluate the capability of

LLMs as zero-shot video game bug detectors. Our main con-

tributions are as follows:

1. We present the GameBugDescriptions dataset, the

ﬁrst dataset of videos of game bugs with a step-by-

step textual description for bug detection purposes. This

dataset can serve as an out-of-distribution (OOD) chal-

lenge for LLMs.

2. We are the ﬁrst to show that large language models have

promising capabilities to detect video game bugs.

3. We extensively evaluate the performance of two fami-

lies of large language models on the bug detection and

bug type classiﬁcation tasks: InstructGPT (Ouyang et al.

2022) and OPT (Zhang et al. 2022).

4. We analyze the robustness of language models to differ-

ent descriptions of the same event for these tasks.

Our study demonstrates the promising capabilities of

LLMs to play an important role in the automation of the

game testing process.

Background and Related Work

Our work bridges the language modeling, video game, and

software engineering research communities. In this section,

we provide a brief overview of the relevant literature across

these disciplines, in particular, on large language models and

prompt engineering, and automated game testing.

Large Language Models and Prompt Engineering

The training objective in a language model is to learn a prob-

ability distribution over some text corpus. Such a simple

training objective combined with sufﬁcient model scaling

can yield large language models that are successful even for

tasks for which the model was not explicitly trained (Ka-

plan et al. 2020; Brown et al. 2020; Chowdhery et al. 2022;

Thoppilan et al. 2022).

Prompting or prompt engineering (Liu et al. 2021) is an

effective technique wherein we condition a language model

on a set of manually handcrafted (Schick and Schütze 2020;

Kojima et al. 2022) or automated (Gao, Fisch, and Chen

2020) templates to solve new tasks. That is, new tasks can

be solved by giving natural language instructions to a pre-

trained model without any further training, e.g., by provid-

ing sample reasoning steps to the model (Wei et al. 2022) in

a few-shot setting. Moreover, Kojima et al. (2022) showed

that even with a prompting technique as simple as adding

“Let’s think step by step” to the beginning of the answer, it is

possible to trigger the reasoning in language models, which

leads to higher accuracy improvement on multiple bench-

marks in zero-shot setting. Using graphical models Dohan

et al. (2022) introduced a general formulation for prompted

models, enabling probabilistic programming with LLMs.

Several successful applications of LLMs include program

synthesis (Jain et al. 2022), code generation (Chen et al.

2021) or chatbots (Thoppilan et al. 2022). However, we are

the ﬁrst to apply LLMs to detect bugs in video games.

Automated Game Testing

As shown by prior work, automated game testing is chal-

lenging because game-speciﬁc knowledge and common

sense reasoning are required to detect and report bugs (Pas-

carella et al. 2018; Politowski, Petrillo, and Guéhéneuc

2021). The majority of prior work on automated game test-

ing focuses on methods to automatically play games, such

as heuristic search strategies (Keehl and Smith 2019). Auto-

mated play techniques using reinforcement learning or evo-

lutionary strategies (Zheng et al. 2019; Vinyals et al. 2019;

Berner et al. 2019; Justesen et al. 2019) allow the testing

of video games from different perspectives, such as playa-

bility, game balance, and even predicting user churn rate

(Roohi et al. 2020, 2021). However, these methods are of-

ten designed to maximize a certain reward function, which

might lead to progress in the game in an unintended manner

and even break the game’s rules or physics engine (Baker

et al. 2020; Clark and Amodei 2019). More importantly,

these methods do not have common sense reasoning.

Other prior work has leveraged computer vision and NLP

techniques for automated video game testing. Several stud-

ies have proposed approaches for graphical bug detection us-

ing deep learning (Ling, Tollmar, and Gisslén 2020; Taesiri,

Habibi, and Fazli 2020) or digital image processing (Mack-

lon et al. 2022). However, these approaches do not re-

quire common sense reasoning. For example, Macklon et al.

(2022) rely on graphical assets of the game as a test oracle.

Several other studies have proposed approaches to re-

trieve moments from gameplay videos based on text

queries (Zhang and Smith 2019; Taesiri, Macklon, and Beze-

mer 2022). However, to detect bugs with these approaches

the bug instance must be known in advance, and therefore

these gameplay event retrieval approaches do not allow for

automated detection of (previously undiscovered) bugs. Our

approach does not have this requirement and can therefore

be used to identify previously undiscovered bugs.

Finally, prior work has proposed NLP-based approaches

to automatically improve test case descriptions for manual

playtesting of games (Viggiato et al. 2022a,b), but we are

the ﬁrst to leverage LLMs for bug detection in video games.

Bug Detection with Large Language Models

To automatically identify buggy events in a video game, we

propose using LLMs to reason about sequences of textual

descriptions of game events. We formulate the problem as a

question-answering (Q&A) task (Srivastava et al. 2022) for

LLMs. Here, we explain how we convert textual descriptions

of a sequence of game events into a multiple-choice question

and use a language model to identify the buggy event. In ad-

dition, we discuss how LLMs can assist us to classify the bug

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LargeLanguageModelsarePrettyGoodZero-ShotVideoGameBugDetectorsMohammadRezaTaesiriFinlayMacklonYiheWangHengshuoShenCor-PaulBezemerUniversityofAlberta{taesiri,macklon,yihe2,hengshuo,bezemer}@ualberta.caAbstractVideogametestingrequiresgame-specicknowledgeaswellascommonsensereasoningabouttheeventsinthe...

展开>> 收起<<

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors Mohammad Reza Taesiri Finlay Macklon Yihe Wang Hengshuo Shen Cor-Paul Bezemer.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors Mohammad Reza Taesiri Finlay Macklon Yihe Wang Hengshuo Shen Cor-Paul Bezemer

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: