CS1QA A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course Changyoon Lee Yeon Seonwoo Alice Oh

2025-05-06 0 0 1.52MB 15 页 10玖币
侵权投诉
CS1QA: A Dataset for Assisting Code-based Question Answering in an
Introductory Programming Course
Changyoon Lee, Yeon Seonwoo, Alice Oh
School of Computing, KAIST
changyoon.lee@kaist.ac.kr, yeon.seonwoo@kaist.ac.kr
alice.oh@kaist.edu
Abstract
We introduce CS1QA, a dataset for code-
based question answering in the programming
education domain. CS1QA consists of 9,237
question-answer pairs gathered from chat logs
in an introductory programming class using
Python, and 17,698 unannotated chat data
with code1. Each question is accompanied
with the student’s code, and the portion of
the code relevant to answering the question.
We carefully design the annotation process to
construct CS1QA, and analyze the collected
dataset in detail. The tasks for CS1QA are
to predict the question type, the relevant code
snippet given the question and the code and re-
trieving an answer from the annotated corpus.
Results for the experiments on several baseline
models are reported and thoroughly analyzed.
The tasks for CS1QA challenge models to un-
derstand both the code and natural language.
This unique dataset can be used as a bench-
mark for source code comprehension and ques-
tion answering in the educational setting.
1 Introduction
Question answering (QA) studies systems that
understand questions and the relevant context to
provide answers. Question forms include single
document QA (Rajpurkar et al.,2016), multi-hop
QA (Yang et al.,2018), conversational QA (Reddy
et al.,2019), and open domain QA (Kwiatkowski
et al.,2019). Questions about specific domains
are asked in NewsQA (Trischler et al.,2016) and
TechQA (Castelli et al.,2020), and images are pro-
vided with the question in visual QA (Antol et al.,
2015). Another interesting field of QA asks ques-
tions about source code (Liu and Wan,2021).
A useful application of QA is in the educational
domain. Asking questions and getting the answer is
an essential and efficient means of learning. In this
paper, we focus on QA for programming education,
1
The code and the data used in this paper can be found at
https://github.com/cyoon47/CS1QA.
Figure 1: An example of our data tuple. Each data tuple
consists of {question, answer, question type, code, rele-
vant code lines}. We annotate the type of each question
and the code lines (orange) relevant to the question.
where both the input modes and the domain pose
interesting challenges. Answering these questions
requires reading and understanding both source
code and natural language questions. In addition,
students’ questions are often complex, demanding
thorough understanding of the context such as the
intention and the educational goal to answer them.
Recently, models that understand programming
languages (PL) have been studied, and show
promising results in diverse code comprehension
tasks (Alon et al.,2018;Feng et al.,2020;Guo
et al.,2021). However, these models have limi-
tations to support question answering. They are
not trained on datasets containing questions about
the code and are not designed for QA tasks. Also,
many assume fully functional code as input, while
students’ code contains diverse syntax and logical
errors and is often incomplete.
To address this issue, we introduce CS1QA, a
new dataset with tasks for code-based question an-
swering in programming education. Questions and
answers about programming are collected from the
arXiv:2210.14494v1 [cs.CL] 26 Oct 2022
naturally occurring chat messages between students
and TAs. The question type and the code snippet
relevant to answering the question are also col-
lected. The final CS1QA dataset consists of ques-
tion, question type, answer, and code annotated
with relevant lines. The data is collected mostly in
Korean and then machine-translated into English
and quality-checked for easy application on models
pretrained in English. Figure 1shows an example
of our data. We also include two-semesters’ worth
of TA-student chat log data consisting of 17,698
chat sessions and the corresponding code.
We design three tasks for the CS1QA dataset.
Type classification task asks the model to predict
the question type. Code line selection task asks the
model to select lines of code that are relevant to
answering the given question. Answer retrieval task
finds a similar question already answered, and uses
its answer as the answer to the given question. The
outputs for these tasks can help the students debug
their code and the TAs spend less time and effort
when answering the students’ questions.
Finally, we implement and test baseline mod-
els, RoBERTa (Liu et al.,2019), CodeBERT (Feng
et al.,2020) and XLM-RoBERTa (Conneau et al.,
2020), on the type classification and code line selec-
tion tasks. The finetuned models achieve accuracies
up to 76.65% for the type classification task. The
relatively low F1 scores of 57.57% for the line se-
lection task suggest that the task is challenging for
current language models. We use DPR (Karpukhin
et al.,2020) to retrieve the most similar question
and its answer. We compare the retrieved answer
with the gold label answer, and achieve a BLEU-1
score of 13.07, which shows incompetent perfor-
mance of answer retrieval on CS1QA dataset. We
show with a qualitative evaluation the model be-
havior with different inputs for the first two tasks.
Our contributions are as follows:
We present CS1QA, a dataset containing
9,237 question-answer-code triples from a pro-
gramming course, annotated with question
types and relevant code lines. The dataset’s
contribution includes student-TA chat logs in
a live classroom.
We introduce three tasks, question type clas-
sification, code line selection and answer re-
trieval, that require models to comprehend the
text and provide useful output for TAs and
students when answering questions.
We present the results of baseline models on
the tasks. Models find the tasks in CS1QA
challenging, and have much room for improve-
ment in performance.
2 Related Work
Code-based Datasets
Recently, research deal-
ing with large amounts of source code data has
gained attention. Often, the source code data is
collected ad hoc for the purpose of the research
(Allamanis et al.,2018;Brockschmidt et al.,2018;
Clement et al.,2020). Several datasets have been
released to aid research in source code comprehen-
sion, and avoid repeated crawling and processing
of source code data. These datasets serve as bench-
marks for different tasks that test the ability to un-
derstand code. Such datasets include: ETH Py150
corpus (Raychev et al.,2016), CodeNN (Iyer et al.,
2016), CodeSearchNet (Husain et al.,2020) and
CodeQA (Liu and Wan,2021). We compare these
datasets with CS1QA in Table 1.
In an educational setting, students’ code presents
different chracteristics from code in these datasets:
1) students’ code is often incomplete, 2) there
are many errors in the code, 3) students’ code
is generally longer than code used in existing
datasets, and 4) questions and answers from stu-
dents and TAs provide important additional in-
formation. In CS1QA, we present a dataset more
suited for the programming education context.
Source Code Comprehension
In the domain of
machine learning and software engineering, under-
standing and representing source code using neu-
ral networks has become an important approach.
Different approaches make use of different charac-
teristics present in programming languages. One
such characteristic is the rich syntactic informa-
tion found in the source code’s abstract syntax tree
(AST). Code2seq (Alon et al.,2018) passes paths
in the AST through an encoder-decoder network
to represent code. The graph structure of AST has
been exploited in other research for source code
representation on downstream tasks such as vari-
able misuse detection, code generation, natural lan-
guage code search and program repair (Allamanis
et al.,2018;Brockschmidt et al.,2018;Guo et al.,
2021;Yasunaga and Liang,2020). Source code text
itself is used in models such as CodeBERT (Feng
et al.,2020), CuBERT (Kanade et al.,2020) and
DeepFix (Gupta et al.,2017) for use in tasks such
Dataset Programming Language Data Format Dataset Size Data Source
ETH Py150 Python Parsed AST 7.4M files GitHub
CodeNN C#, SQL Title, question, answer 187,000 pairs StackOverflow
CodeSearchNet Go, Java, JavaScript, PHP, Python, Ruby Comment, code 2M pairs GitHub
CodeQA Java, Python Question, answer, code 190,000 pairs GitHub
CS1QA Python Chat log, question, answer, type, code 9,237 pairs Real-world classroom
Table 1: Comparison between different code-based datasets and CS1QA.
as natural language code search, finding function-
docstring mismatch and program repair.
The tasks that these methods are trained on tar-
get expert software engineers and programmers
who can gain significant benefit with support by
the model. On the other hand, students learning
programming have different objectives and require
fitting support by the models. Rather than getting
an answer quickly, students seek to Students ask
lots of questions while learning, and thus question
answering for code is needed. CS1QA focuses on
code-based question answering and can be used as
training data and a benchmark for neural models
in an education setting. The CS1QA data can also
be used for other tasks than QA, such as program
repair and code search.
3 CS1QA Dataset
3.1 Data Source
The data for CS1QA is collected from an intro-
ductory programming course conducted online
2
.
Students complete lab sessions consisting of sev-
eral programming tasks and students and TAs ask
questions to each other using a synchronous chat
feature. We make use of the chat logs as the source
for the natural question and the corresponding an-
swer. These chat logs are either in Korean or in
English. The student’s code history is also stored
for each programming task for every keystroke the
student makes. This allows us to extract the code
status at the exact time the question is asked, which
provides valuable context for the question. We take
this code as the context for the given question. The
thorough code history and the student-TA chat logs
are a unique and important contribution of CS1QA.
CS1QA also contributes with data from multiple
students working on the same set of problems.
3.2 Question Type Categorization
Answering different types of questions requires un-
derstanding the different intentions and information
2Elice https://elice.io/
- answering questions about errors requires identi-
fying the erroneous code and answering questions
about algorithms requires understanding the overall
program flow. As the different question types affect
the answering approach and location of code to
look at, knowing them in advance can be beneficial
in the QA and code selection tasks.
Allamanis and Sutton (2013) have categorized
questions asking for help in coding on Stack Over-
flow into five types. We adapt these types to stu-
dents’ questions. In addition, we define the “Task”
type that asks about the requirements of the task.
TAs’ question types are derived from the official
instructions by the course instructors given in the
beginning of the semester. TAs were instructed to
ask questions that gauge students’ understanding
of their implementation, for example by asking
the meaning of the code and reasoning behind the
implementation. TAs’ probing questions are cate-
gorized into five types: Comparison,Reasoning,
Explanation,Meaning, and Guiding. Examples for
the question types can be found in Table 2. We
present intentions of the question types in Table 3.
3.3 Collecting Question-Answer Pairs with
Question Types
We collected a total of 5,565 chat logs over the
course of one semester from 474 students and 47
TAs. After removing the logs where the TA did not
participate in the chat, 4,883 chat logs remained.
We employed crowdworkers with self-reported
skill in Python of three or higher on a 5-point Lik-
ert Scale to collect the questions. Each worker first
selected messages in the chat log corresponding to
the question and the answer, then selected the ques-
tion type. Workers were provided with descriptions
of the question types with examples before work-
ing on the task. Workers were asked to divide the
message into individual questions when there were
multiple questions or answers in the message. They
were instructed to only choose programming re-
lated questions, for which the answer is obvious in
the chat from the question alone. This ensures that
摘要:

CS1QA:ADatasetforAssistingCode-basedQuestionAnsweringinanIntroductoryProgrammingCourseChangyoonLee,YeonSeonwoo,AliceOhSchoolofComputing,KAISTchangyoon.lee@kaist.ac.kr,yeon.seonwoo@kaist.ac.kralice.oh@kaist.eduAbstractWeintroduceCS1QA,adatasetforcode-basedquestionansweringintheprogrammingeducationdom...

展开>> 收起<<
CS1QA A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course Changyoon Lee Yeon Seonwoo Alice Oh.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.52MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注