CS1QA A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course Changyoon Lee Yeon Seonwoo Alice Oh

2025-05-06 0 0 1.52MB 15 页 10玖币

侵权投诉

CS1QA: A Dataset for Assisting Code-based Question Answering in an

Introductory Programming Course

Changyoon Lee, Yeon Seonwoo, Alice Oh

School of Computing, KAIST

changyoon.lee@kaist.ac.kr, yeon.seonwoo@kaist.ac.kr

alice.oh@kaist.edu

Abstract

We introduce CS1QA, a dataset for code-

based question answering in the programming

education domain. CS1QA consists of 9,237

question-answer pairs gathered from chat logs

in an introductory programming class using

Python, and 17,698 unannotated chat data

with code1. Each question is accompanied

with the student’s code, and the portion of

the code relevant to answering the question.

We carefully design the annotation process to

construct CS1QA, and analyze the collected

dataset in detail. The tasks for CS1QA are

to predict the question type, the relevant code

snippet given the question and the code and re-

trieving an answer from the annotated corpus.

Results for the experiments on several baseline

models are reported and thoroughly analyzed.

The tasks for CS1QA challenge models to un-

derstand both the code and natural language.

This unique dataset can be used as a bench-

mark for source code comprehension and ques-

tion answering in the educational setting.

1 Introduction

Question answering (QA) studies systems that

understand questions and the relevant context to

provide answers. Question forms include single

document QA (Rajpurkar et al.,2016), multi-hop

QA (Yang et al.,2018), conversational QA (Reddy

et al.,2019), and open domain QA (Kwiatkowski

et al.,2019). Questions about speciﬁc domains

are asked in NewsQA (Trischler et al.,2016) and

TechQA (Castelli et al.,2020), and images are pro-

vided with the question in visual QA (Antol et al.,

2015). Another interesting ﬁeld of QA asks ques-

tions about source code (Liu and Wan,2021).

A useful application of QA is in the educational

domain. Asking questions and getting the answer is

an essential and efﬁcient means of learning. In this

paper, we focus on QA for programming education,

The code and the data used in this paper can be found at

https://github.com/cyoon47/CS1QA.

Figure 1: An example of our data tuple. Each data tuple

consists of {question, answer, question type, code, rele-

vant code lines}. We annotate the type of each question

and the code lines (orange) relevant to the question.

where both the input modes and the domain pose

interesting challenges. Answering these questions

requires reading and understanding both source

code and natural language questions. In addition,

students’ questions are often complex, demanding

thorough understanding of the context such as the

intention and the educational goal to answer them.

Recently, models that understand programming

languages (PL) have been studied, and show

promising results in diverse code comprehension

tasks (Alon et al.,2018;Feng et al.,2020;Guo

et al.,2021). However, these models have limi-

tations to support question answering. They are

not trained on datasets containing questions about

the code and are not designed for QA tasks. Also,

many assume fully functional code as input, while

students’ code contains diverse syntax and logical

errors and is often incomplete.

To address this issue, we introduce CS1QA, a

new dataset with tasks for code-based question an-

swering in programming education. Questions and

answers about programming are collected from the

arXiv:2210.14494v1 [cs.CL] 26 Oct 2022

naturally occurring chat messages between students

and TAs. The question type and the code snippet

relevant to answering the question are also col-

lected. The ﬁnal CS1QA dataset consists of ques-

tion, question type, answer, and code annotated

with relevant lines. The data is collected mostly in

Korean and then machine-translated into English

and quality-checked for easy application on models

pretrained in English. Figure 1shows an example

of our data. We also include two-semesters’ worth

of TA-student chat log data consisting of 17,698

chat sessions and the corresponding code.

We design three tasks for the CS1QA dataset.

Type classiﬁcation task asks the model to predict

the question type. Code line selection task asks the

model to select lines of code that are relevant to

answering the given question. Answer retrieval task

ﬁnds a similar question already answered, and uses

its answer as the answer to the given question. The

outputs for these tasks can help the students debug

their code and the TAs spend less time and effort

when answering the students’ questions.

Finally, we implement and test baseline mod-

els, RoBERTa (Liu et al.,2019), CodeBERT (Feng

et al.,2020) and XLM-RoBERTa (Conneau et al.,

2020), on the type classiﬁcation and code line selec-

tion tasks. The ﬁnetuned models achieve accuracies

up to 76.65% for the type classiﬁcation task. The

relatively low F1 scores of 57.57% for the line se-

lection task suggest that the task is challenging for

current language models. We use DPR (Karpukhin

et al.,2020) to retrieve the most similar question

and its answer. We compare the retrieved answer

with the gold label answer, and achieve a BLEU-1

score of 13.07, which shows incompetent perfor-

mance of answer retrieval on CS1QA dataset. We

show with a qualitative evaluation the model be-

havior with different inputs for the ﬁrst two tasks.

Our contributions are as follows:

•

We present CS1QA, a dataset containing

9,237 question-answer-code triples from a pro-

gramming course, annotated with question

types and relevant code lines. The dataset’s

contribution includes student-TA chat logs in

a live classroom.

•

We introduce three tasks, question type clas-

siﬁcation, code line selection and answer re-

trieval, that require models to comprehend the

text and provide useful output for TAs and

students when answering questions.

•

We present the results of baseline models on

the tasks. Models ﬁnd the tasks in CS1QA

challenging, and have much room for improve-

ment in performance.

2 Related Work

Code-based Datasets

Recently, research deal-

ing with large amounts of source code data has

gained attention. Often, the source code data is

collected ad hoc for the purpose of the research

(Allamanis et al.,2018;Brockschmidt et al.,2018;

Clement et al.,2020). Several datasets have been

released to aid research in source code comprehen-

sion, and avoid repeated crawling and processing

of source code data. These datasets serve as bench-

marks for different tasks that test the ability to un-

derstand code. Such datasets include: ETH Py150

corpus (Raychev et al.,2016), CodeNN (Iyer et al.,

2016), CodeSearchNet (Husain et al.,2020) and

CodeQA (Liu and Wan,2021). We compare these

datasets with CS1QA in Table 1.

In an educational setting, students’ code presents

different chracteristics from code in these datasets:

1) students’ code is often incomplete, 2) there

are many errors in the code, 3) students’ code

is generally longer than code used in existing

datasets, and 4) questions and answers from stu-

dents and TAs provide important additional in-

formation. In CS1QA, we present a dataset more

suited for the programming education context.

Source Code Comprehension

In the domain of

machine learning and software engineering, under-

standing and representing source code using neu-

ral networks has become an important approach.

Different approaches make use of different charac-

teristics present in programming languages. One

such characteristic is the rich syntactic informa-

tion found in the source code’s abstract syntax tree

(AST). Code2seq (Alon et al.,2018) passes paths

in the AST through an encoder-decoder network

to represent code. The graph structure of AST has

been exploited in other research for source code

representation on downstream tasks such as vari-

able misuse detection, code generation, natural lan-

guage code search and program repair (Allamanis

et al.,2018;Brockschmidt et al.,2018;Guo et al.,

2021;Yasunaga and Liang,2020). Source code text

itself is used in models such as CodeBERT (Feng

et al.,2020), CuBERT (Kanade et al.,2020) and

DeepFix (Gupta et al.,2017) for use in tasks such

Dataset Programming Language Data Format Dataset Size Data Source

ETH Py150 Python Parsed AST 7.4M ﬁles GitHub

CodeNN C#, SQL Title, question, answer ∼187,000 pairs StackOverﬂow

CodeSearchNet Go, Java, JavaScript, PHP, Python, Ruby Comment, code ∼2M pairs GitHub

CodeQA Java, Python Question, answer, code ∼190,000 pairs GitHub

CS1QA Python Chat log, question, answer, type, code 9,237 pairs Real-world classroom

Table 1: Comparison between different code-based datasets and CS1QA.

as natural language code search, ﬁnding function-

docstring mismatch and program repair.

The tasks that these methods are trained on tar-

get expert software engineers and programmers

who can gain signiﬁcant beneﬁt with support by

the model. On the other hand, students learning

programming have different objectives and require

ﬁtting support by the models. Rather than getting

an answer quickly, students seek to Students ask

lots of questions while learning, and thus question

answering for code is needed. CS1QA focuses on

code-based question answering and can be used as

training data and a benchmark for neural models

in an education setting. The CS1QA data can also

be used for other tasks than QA, such as program

repair and code search.

3 CS1QA Dataset

3.1 Data Source

The data for CS1QA is collected from an intro-

ductory programming course conducted online

Students complete lab sessions consisting of sev-

eral programming tasks and students and TAs ask

questions to each other using a synchronous chat

feature. We make use of the chat logs as the source

for the natural question and the corresponding an-

swer. These chat logs are either in Korean or in

English. The student’s code history is also stored

for each programming task for every keystroke the

student makes. This allows us to extract the code

status at the exact time the question is asked, which

provides valuable context for the question. We take

this code as the context for the given question. The

thorough code history and the student-TA chat logs

are a unique and important contribution of CS1QA.

CS1QA also contributes with data from multiple

students working on the same set of problems.

3.2 Question Type Categorization

Answering different types of questions requires un-

derstanding the different intentions and information

2Elice https://elice.io/

- answering questions about errors requires identi-

fying the erroneous code and answering questions

about algorithms requires understanding the overall

program ﬂow. As the different question types affect

the answering approach and location of code to

look at, knowing them in advance can be beneﬁcial

in the QA and code selection tasks.

Allamanis and Sutton (2013) have categorized

questions asking for help in coding on Stack Over-

ﬂow into ﬁve types. We adapt these types to stu-

dents’ questions. In addition, we deﬁne the “Task”

type that asks about the requirements of the task.

TAs’ question types are derived from the ofﬁcial

instructions by the course instructors given in the

beginning of the semester. TAs were instructed to

ask questions that gauge students’ understanding

of their implementation, for example by asking

the meaning of the code and reasoning behind the

implementation. TAs’ probing questions are cate-

gorized into ﬁve types: Comparison,Reasoning,

Explanation,Meaning, and Guiding. Examples for

the question types can be found in Table 2. We

present intentions of the question types in Table 3.

3.3 Collecting Question-Answer Pairs with

Question Types

We collected a total of 5,565 chat logs over the

course of one semester from 474 students and 47

TAs. After removing the logs where the TA did not

participate in the chat, 4,883 chat logs remained.

We employed crowdworkers with self-reported

skill in Python of three or higher on a 5-point Lik-

ert Scale to collect the questions. Each worker ﬁrst

selected messages in the chat log corresponding to

the question and the answer, then selected the ques-

tion type. Workers were provided with descriptions

of the question types with examples before work-

ing on the task. Workers were asked to divide the

message into individual questions when there were

multiple questions or answers in the message. They

were instructed to only choose programming re-

lated questions, for which the answer is obvious in

the chat from the question alone. This ensures that

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CS1QA:ADatasetforAssistingCode-basedQuestionAnsweringinanIntroductoryProgrammingCourseChangyoonLee,YeonSeonwoo,AliceOhSchoolofComputing,KAISTchangyoon.lee@kaist.ac.kr,yeon.seonwoo@kaist.ac.kralice.oh@kaist.eduAbstractWeintroduceCS1QA,adatasetforcode-basedquestionansweringintheprogrammingeducationdom...

展开>> 收起<<

CS1QA A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course Changyoon Lee Yeon Seonwoo Alice Oh.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CS1QA A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course Changyoon Lee Yeon Seonwoo Alice Oh

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: