Multifaceted Hierarchical Report Identification for
Non-Functional Bugs in Deep Learning Frameworks
Guoming Long
Department of Computer Science
Loughborough University
Loughborough, United Kingdom
g.long@lboro.ac.uk
Tao Chen∗
Department of Computer Science
Loughborough University
Loughborough, United Kingdom
t.t.chen@lboro.ac.uk
Georgina Cosma
Department of Computer Science
Loughborough University
Loughborough, United Kingdom
g.cosma@lboro.ac.uk
Abstract—Non-functional bugs (e.g., performance- or
accuracy-related bugs) in Deep Learning (DL) frameworks can
lead to some of the most devastating consequences. Reporting
those bugs on a repository such as GitHub is a standard route
to fix them. Yet, given the growing number of new GitHub
reports for DL frameworks, it is intrinsically difficult for
developers to distinguish those that reveal non-functional bugs
among the others, and assign them to the right contributor
for investigation in a timely manner. In this paper, we propose
MHNurf — an end-to-end tool for automatically identifying
non-functional bug related reports in DL frameworks. The core
of MHNurf is a Multifaceted Hierarchical Attention Network
(MHAN) that tackles three unaddressed challenges: (1) learning
the semantic knowledge, but doing so by (2) considering the
hierarchy (e.g., words/tokens in sentences/statements) and
focusing on the important parts (i.e., words, tokens, sentences,
and statements) of a GitHub report, while (3) independently
extracting information from different types of features, i.e.,
content,comment,code,command, and label.
To evaluate MHNurf, we leverage 3,721 GitHub reports from
five DL frameworks for conducting experiments. The results show
that MHNurf works the best with a combination of content,
comment, and code, which considerably outperforms the classic
HAN where only the content is used. MHNurf also produces
significantly more accurate results than nine other state-of-the-
art classifiers with strong statistical significance, i.e., up to 71%
AUC improvement and has the best Scott-Knott rank on four
frameworks while 2nd on the remaining one. To facilitate repro-
duction and promote future research, we have made our dataset,
code, and detailed supplementary results publicly available at:
https://github.com/ideas-labo/APSEC2022-MHNurf.
Index Terms—Bug Report Analysis, Deep Learning, Natural
Language Processing, Software Maintenance, Performance Bug
I. INTRODUCTION
Deep learning (DL), which is a kind of machine intelligence
algorithms that mimics the workings of the human brain in
processing data [1], has been gaining momentum in both
academia and industry [2, 3, 4, 5, 6]. As such, several
well-known DL frameworks (e.g., TensorFlow,Keras, and
PyTorch) were created and maintained on GitHub, aiming
to provide effective and readily available API for seamlessly
adopting the DL algorithms into real-world problems.
Despite the success of DL frameworks, they inevitably con-
tain bugs, which, if left unfixed, would propagate issues to any
∗Corresponding author
applications that were built on top of them [7]. Among other
bugs, there exist non-functional bugs that have no explicit
symptoms of exceptions (such as a Not-a-Number error or the
program crashes), i.e., they cannot be judged by using a precise
oracle. For instance, common examples of non-functional
bugs are performance- or accuracy-related bugs (which is the
focus of this work), since from the perspective of the DL
frameworks, it is typically hard to understand how “slow” or
how “inaccurate” the results are would be considered as a
bug without thorough investigation, therefore they are more
challenging to be analyzed. However, those non-functional
bugs tend to cause some of the most devastating outcomes
and hence are of great concern [8, 9]. Indeed, according to
the U.S. National Transportation Safety Board (NTSB), the
recent accident of Uber’s self-driving car was caused by a
non-functional bug of their DL framework, which classified
the pedestrian as an unknown object with a slow reaction1.
To deal with bugs, it is a normal Software Engineering
practice for DL frameworks to allow users to submit a report
on repositories like GitHub, which would then be assigned to
a contributor for formal investigation with an attempt to fix the
bug, if any [10]. Identifying whether a report is non-functional
bugs related (among other functional counterparts) is a labor-
intensive process. This is because firstly, the number of new
reports increases dramatically. For example, there are around
700 monthly new GitHub reports for Tensorflow in average2,
including bugs related ones and those for other purposes, such
as feature requests and help seeking. Secondly, GitHub reports
can be lengthy, e.g., it could be up to 332 sentences per
report on average [11]. Finally, given the vague nature of non-
functional bug, it is fundamentally difficult to understand if the
related reports really reflect bugs. The above mean that, when
assigning or prioritizing the GitHub reports, it can take a long
time for developers to read and understand the bug reports,
hence delaying the potential fixes to the destructive non-
functional bugs, especially when some of the key messages
are deeply hidden inside.
In light of the above, the problem we focus on in this paper
is the following: given a GitHub report for DL framework,
1https://tinyurl.com/ykufbpey.
2https://github.com/tensorflow/tensorflow/pulse/monthly
arXiv:2210.01855v1 [cs.SE] 4 Oct 2022