Multifaceted Hierarchical Report Identification for Non-Functional Bugs in Deep Learning Frameworks Guoming Long

2025-05-02 0 0 780.98KB 10 页 10玖币
侵权投诉
Multifaceted Hierarchical Report Identification for
Non-Functional Bugs in Deep Learning Frameworks
Guoming Long
Department of Computer Science
Loughborough University
Loughborough, United Kingdom
g.long@lboro.ac.uk
Tao Chen
Department of Computer Science
Loughborough University
Loughborough, United Kingdom
t.t.chen@lboro.ac.uk
Georgina Cosma
Department of Computer Science
Loughborough University
Loughborough, United Kingdom
g.cosma@lboro.ac.uk
Abstract—Non-functional bugs (e.g., performance- or
accuracy-related bugs) in Deep Learning (DL) frameworks can
lead to some of the most devastating consequences. Reporting
those bugs on a repository such as GitHub is a standard route
to fix them. Yet, given the growing number of new GitHub
reports for DL frameworks, it is intrinsically difficult for
developers to distinguish those that reveal non-functional bugs
among the others, and assign them to the right contributor
for investigation in a timely manner. In this paper, we propose
MHNurf — an end-to-end tool for automatically identifying
non-functional bug related reports in DL frameworks. The core
of MHNurf is a Multifaceted Hierarchical Attention Network
(MHAN) that tackles three unaddressed challenges: (1) learning
the semantic knowledge, but doing so by (2) considering the
hierarchy (e.g., words/tokens in sentences/statements) and
focusing on the important parts (i.e., words, tokens, sentences,
and statements) of a GitHub report, while (3) independently
extracting information from different types of features, i.e.,
content,comment,code,command, and label.
To evaluate MHNurf, we leverage 3,721 GitHub reports from
five DL frameworks for conducting experiments. The results show
that MHNurf works the best with a combination of content,
comment, and code, which considerably outperforms the classic
HAN where only the content is used. MHNurf also produces
significantly more accurate results than nine other state-of-the-
art classifiers with strong statistical significance, i.e., up to 71%
AUC improvement and has the best Scott-Knott rank on four
frameworks while 2nd on the remaining one. To facilitate repro-
duction and promote future research, we have made our dataset,
code, and detailed supplementary results publicly available at:
https://github.com/ideas-labo/APSEC2022-MHNurf.
Index Terms—Bug Report Analysis, Deep Learning, Natural
Language Processing, Software Maintenance, Performance Bug
I. INTRODUCTION
Deep learning (DL), which is a kind of machine intelligence
algorithms that mimics the workings of the human brain in
processing data [1], has been gaining momentum in both
academia and industry [2, 3, 4, 5, 6]. As such, several
well-known DL frameworks (e.g., TensorFlow,Keras, and
PyTorch) were created and maintained on GitHub, aiming
to provide effective and readily available API for seamlessly
adopting the DL algorithms into real-world problems.
Despite the success of DL frameworks, they inevitably con-
tain bugs, which, if left unfixed, would propagate issues to any
Corresponding author
applications that were built on top of them [7]. Among other
bugs, there exist non-functional bugs that have no explicit
symptoms of exceptions (such as a Not-a-Number error or the
program crashes), i.e., they cannot be judged by using a precise
oracle. For instance, common examples of non-functional
bugs are performance- or accuracy-related bugs (which is the
focus of this work), since from the perspective of the DL
frameworks, it is typically hard to understand how “slow” or
how “inaccurate” the results are would be considered as a
bug without thorough investigation, therefore they are more
challenging to be analyzed. However, those non-functional
bugs tend to cause some of the most devastating outcomes
and hence are of great concern [8, 9]. Indeed, according to
the U.S. National Transportation Safety Board (NTSB), the
recent accident of Uber’s self-driving car was caused by a
non-functional bug of their DL framework, which classified
the pedestrian as an unknown object with a slow reaction1.
To deal with bugs, it is a normal Software Engineering
practice for DL frameworks to allow users to submit a report
on repositories like GitHub, which would then be assigned to
a contributor for formal investigation with an attempt to fix the
bug, if any [10]. Identifying whether a report is non-functional
bugs related (among other functional counterparts) is a labor-
intensive process. This is because firstly, the number of new
reports increases dramatically. For example, there are around
700 monthly new GitHub reports for Tensorflow in average2,
including bugs related ones and those for other purposes, such
as feature requests and help seeking. Secondly, GitHub reports
can be lengthy, e.g., it could be up to 332 sentences per
report on average [11]. Finally, given the vague nature of non-
functional bug, it is fundamentally difficult to understand if the
related reports really reflect bugs. The above mean that, when
assigning or prioritizing the GitHub reports, it can take a long
time for developers to read and understand the bug reports,
hence delaying the potential fixes to the destructive non-
functional bugs, especially when some of the key messages
are deeply hidden inside.
In light of the above, the problem we focus on in this paper
is the following: given a GitHub report for DL framework,
1https://tinyurl.com/ykufbpey.
2https://github.com/tensorflow/tensorflow/pulse/monthly
arXiv:2210.01855v1 [cs.SE] 4 Oct 2022
can we automatically learn and identify whether it is a non-
functional bug related report? Indeed, many existing classifiers
on bug report identification can be directly applied. For
example, those that identify a particular type of bug report [12]
(e.g., long-lived bugs); those that predict whether a bug report
is bug-related [13]; and those that classify reports based on
labels [14, 15, 16]. However, in addition to the fact that these
works do not target the level of DL frameworks, they have
failed to handle some or all of the following challenges, which
are important in the report identification:
Semantics matter: Depending on the context, the same
words or code tokens in the GitHub reports can have
different meanings. Existing classifiers using statistical
learning algorithms [12] could fail to handle this poly-
semy.
Multiple types of features exist: While most existing
classifiers consider the content (title and description) of
a GitHub report [12, 13, 14, 15, 16], other types of
features may also provide useful information, such as the
accumulated comments made by the participants before a
contributor is assigned to the report. Further, the mix of
code and natural language in a report can pose additional
challenge.
Not all parts are equally relevant: Given a lengthy
GitHub report, not all of the words and sentences are
important in identifying the non-functional bug related
reports. Yet, existing work has often ignored such a
fact [14, 15, 16].
In this paper, we propose Multifaceted Hierarchical Non-
functional Bug Report Identification, dubbed MHNurf — an
end to end tool for automatically identifying non-functional
bug related reports for DL frameworks. Its core component
is a newly proposed Multifaceted Hierarchical Attention Net-
work (MHAN) in this work, which extends the Hierarchical
Attention Network (HAN) [17].
Contributions: To better identify non-functional bug re-
lated reports for DL frameworks, our contributions include:
MHNurf learns the semantic knowledge by considering
the hierarchy and discriminating important and unimpor-
tant parts in GitHub reports.
The MHAN in MHNurf considers multifacetedness in
GitHub reports, i.e., it learns up to five types of feature
(content,comment,code,command, and label) indepen-
dently.
By using a dataset of 3,721 GitHub reports from five
DL frameworks (i.e., TensorFlow,PyTorch,Keras,
MXNet, and Caffe), the experiment results confirm that
the title and description, which are fundamental parts of
a GitHub report, needs to be considered together as part
of the content feature.
We also found that the combination of content,comment,
and code give the best result in MHNurf. Notably, the
multifacetedness in MHNurf has also helped to consid-
erably improve the prediction against the vanilla HAN in
MHNurf.
Title
Description Label
Code
Command
Comment
Fig. 1: An example of GitHub report from TensorFlow.
By comparing with nine state-of-the-art classifiers, we
show that MHNurf is significantly more accurate in
general (up to 71% AUC improvement). It is also ranked
as the 1st for 4 out of 5 DL frameworks; 2nd for one,
according to the Scott-Knott test [18]. We also conduct a
qualitative analysis on why MHNurf can perform better.
II. PROBLEM CONTEXT AND CHALLENGES
A. Context
DL frameworks hosted on GitHub allows participants and
users to submit issue reports, whose purpose includes, but
not limited to, bugs, pull request, feature request, and others.
Among those, the GitHub reports that are bug-related are often
of high importance to the community, especially the non-
functional bug related reports. Here, we distinguish two types
of users/developers on GitHub:
Contributors: who are assigned to a report so that the
formal investigation starts.
Participants: who have not been assigned to a report,
but are free to make comments on it.
Most commonly, after assignments, those GitHub reports
need to be reviewed by the contributors who will pick the
most important ones to investigate and fix when necessary.
As shown in Figure 1, apart from the normal content (title
and description), a GitHub report may likely be commented on
by different participants before being assigned to a contributor.
Further, a label may also be added by the automatic bot or
by a contributor based on his/her first impression. A formal
investigation begins when the report is assigned to someone.
Our work here is to automatically identify those GitHub re-
ports that are non-functional bug related for a DL framework.
This need not have to be done immediately when a report is
submitted but also can be achieved in a short period of time
after submission, as long as it is before the assignment. This
can then provide useful information for the bot (or human)
who assigns the GitHub reports in terms of which contributor
to assign and how to prioritize them, hence saving more
2
摘要:

MultifacetedHierarchicalReportIdenticationforNon-FunctionalBugsinDeepLearningFrameworksGuomingLongDepartmentofComputerScienceLoughboroughUniversityLoughborough,UnitedKingdomg.long@lboro.ac.ukTaoChenDepartmentofComputerScienceLoughboroughUniversityLoughborough,UnitedKingdomt.t.chen@lboro.ac.ukGeorg...

展开>> 收起<<
Multifaceted Hierarchical Report Identification for Non-Functional Bugs in Deep Learning Frameworks Guoming Long.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:780.98KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注