naturally occurring chat messages between students
and TAs. The question type and the code snippet
relevant to answering the question are also col-
lected. The final CS1QA dataset consists of ques-
tion, question type, answer, and code annotated
with relevant lines. The data is collected mostly in
Korean and then machine-translated into English
and quality-checked for easy application on models
pretrained in English. Figure 1shows an example
of our data. We also include two-semesters’ worth
of TA-student chat log data consisting of 17,698
chat sessions and the corresponding code.
We design three tasks for the CS1QA dataset.
Type classification task asks the model to predict
the question type. Code line selection task asks the
model to select lines of code that are relevant to
answering the given question. Answer retrieval task
finds a similar question already answered, and uses
its answer as the answer to the given question. The
outputs for these tasks can help the students debug
their code and the TAs spend less time and effort
when answering the students’ questions.
Finally, we implement and test baseline mod-
els, RoBERTa (Liu et al.,2019), CodeBERT (Feng
et al.,2020) and XLM-RoBERTa (Conneau et al.,
2020), on the type classification and code line selec-
tion tasks. The finetuned models achieve accuracies
up to 76.65% for the type classification task. The
relatively low F1 scores of 57.57% for the line se-
lection task suggest that the task is challenging for
current language models. We use DPR (Karpukhin
et al.,2020) to retrieve the most similar question
and its answer. We compare the retrieved answer
with the gold label answer, and achieve a BLEU-1
score of 13.07, which shows incompetent perfor-
mance of answer retrieval on CS1QA dataset. We
show with a qualitative evaluation the model be-
havior with different inputs for the first two tasks.
Our contributions are as follows:
•
We present CS1QA, a dataset containing
9,237 question-answer-code triples from a pro-
gramming course, annotated with question
types and relevant code lines. The dataset’s
contribution includes student-TA chat logs in
a live classroom.
•
We introduce three tasks, question type clas-
sification, code line selection and answer re-
trieval, that require models to comprehend the
text and provide useful output for TAs and
students when answering questions.
•
We present the results of baseline models on
the tasks. Models find the tasks in CS1QA
challenging, and have much room for improve-
ment in performance.
2 Related Work
Code-based Datasets
Recently, research deal-
ing with large amounts of source code data has
gained attention. Often, the source code data is
collected ad hoc for the purpose of the research
(Allamanis et al.,2018;Brockschmidt et al.,2018;
Clement et al.,2020). Several datasets have been
released to aid research in source code comprehen-
sion, and avoid repeated crawling and processing
of source code data. These datasets serve as bench-
marks for different tasks that test the ability to un-
derstand code. Such datasets include: ETH Py150
corpus (Raychev et al.,2016), CodeNN (Iyer et al.,
2016), CodeSearchNet (Husain et al.,2020) and
CodeQA (Liu and Wan,2021). We compare these
datasets with CS1QA in Table 1.
In an educational setting, students’ code presents
different chracteristics from code in these datasets:
1) students’ code is often incomplete, 2) there
are many errors in the code, 3) students’ code
is generally longer than code used in existing
datasets, and 4) questions and answers from stu-
dents and TAs provide important additional in-
formation. In CS1QA, we present a dataset more
suited for the programming education context.
Source Code Comprehension
In the domain of
machine learning and software engineering, under-
standing and representing source code using neu-
ral networks has become an important approach.
Different approaches make use of different charac-
teristics present in programming languages. One
such characteristic is the rich syntactic informa-
tion found in the source code’s abstract syntax tree
(AST). Code2seq (Alon et al.,2018) passes paths
in the AST through an encoder-decoder network
to represent code. The graph structure of AST has
been exploited in other research for source code
representation on downstream tasks such as vari-
able misuse detection, code generation, natural lan-
guage code search and program repair (Allamanis
et al.,2018;Brockschmidt et al.,2018;Guo et al.,
2021;Yasunaga and Liang,2020). Source code text
itself is used in models such as CodeBERT (Feng
et al.,2020), CuBERT (Kanade et al.,2020) and
DeepFix (Gupta et al.,2017) for use in tasks such