Multifaceted Hierarchical Report Identiﬁcation for Non-Functional Bugs in Deep Learning Frameworks Guoming Long

2025-05-02 0 0 780.98KB 10 页 10玖币

侵权投诉

Multifaceted Hierarchical Report Identiﬁcation for

Non-Functional Bugs in Deep Learning Frameworks

Guoming Long

Department of Computer Science

Loughborough University

Loughborough, United Kingdom

g.long@lboro.ac.uk

Tao Chen∗

Department of Computer Science

Loughborough University

Loughborough, United Kingdom

t.t.chen@lboro.ac.uk

Georgina Cosma

Department of Computer Science

Loughborough University

Loughborough, United Kingdom

g.cosma@lboro.ac.uk

Abstract—Non-functional bugs (e.g., performance- or

accuracy-related bugs) in Deep Learning (DL) frameworks can

lead to some of the most devastating consequences. Reporting

those bugs on a repository such as GitHub is a standard route

to ﬁx them. Yet, given the growing number of new GitHub

reports for DL frameworks, it is intrinsically difﬁcult for

developers to distinguish those that reveal non-functional bugs

among the others, and assign them to the right contributor

for investigation in a timely manner. In this paper, we propose

MHNurf — an end-to-end tool for automatically identifying

non-functional bug related reports in DL frameworks. The core

of MHNurf is a Multifaceted Hierarchical Attention Network

(MHAN) that tackles three unaddressed challenges: (1) learning

the semantic knowledge, but doing so by (2) considering the

hierarchy (e.g., words/tokens in sentences/statements) and

focusing on the important parts (i.e., words, tokens, sentences,

and statements) of a GitHub report, while (3) independently

extracting information from different types of features, i.e.,

content,comment,code,command, and label.

To evaluate MHNurf, we leverage 3,721 GitHub reports from

ﬁve DL frameworks for conducting experiments. The results show

that MHNurf works the best with a combination of content,

comment, and code, which considerably outperforms the classic

HAN where only the content is used. MHNurf also produces

signiﬁcantly more accurate results than nine other state-of-the-

art classiﬁers with strong statistical signiﬁcance, i.e., up to 71%

AUC improvement and has the best Scott-Knott rank on four

frameworks while 2nd on the remaining one. To facilitate repro-

duction and promote future research, we have made our dataset,

code, and detailed supplementary results publicly available at:

https://github.com/ideas-labo/APSEC2022-MHNurf.

Index Terms—Bug Report Analysis, Deep Learning, Natural

Language Processing, Software Maintenance, Performance Bug

I. INTRODUCTION

Deep learning (DL), which is a kind of machine intelligence

algorithms that mimics the workings of the human brain in

processing data [1], has been gaining momentum in both

academia and industry [2, 3, 4, 5, 6]. As such, several

well-known DL frameworks (e.g., TensorFlow,Keras, and

PyTorch) were created and maintained on GitHub, aiming

to provide effective and readily available API for seamlessly

adopting the DL algorithms into real-world problems.

Despite the success of DL frameworks, they inevitably con-

tain bugs, which, if left unﬁxed, would propagate issues to any

∗Corresponding author

applications that were built on top of them [7]. Among other

bugs, there exist non-functional bugs that have no explicit

symptoms of exceptions (such as a Not-a-Number error or the

program crashes), i.e., they cannot be judged by using a precise

oracle. For instance, common examples of non-functional

bugs are performance- or accuracy-related bugs (which is the

focus of this work), since from the perspective of the DL

frameworks, it is typically hard to understand how “slow” or

how “inaccurate” the results are would be considered as a

bug without thorough investigation, therefore they are more

challenging to be analyzed. However, those non-functional

bugs tend to cause some of the most devastating outcomes

and hence are of great concern [8, 9]. Indeed, according to

the U.S. National Transportation Safety Board (NTSB), the

recent accident of Uber’s self-driving car was caused by a

non-functional bug of their DL framework, which classiﬁed

the pedestrian as an unknown object with a slow reaction1.

To deal with bugs, it is a normal Software Engineering

practice for DL frameworks to allow users to submit a report

on repositories like GitHub, which would then be assigned to

a contributor for formal investigation with an attempt to ﬁx the

bug, if any [10]. Identifying whether a report is non-functional

bugs related (among other functional counterparts) is a labor-

intensive process. This is because ﬁrstly, the number of new

reports increases dramatically. For example, there are around

700 monthly new GitHub reports for Tensorﬂow in average2,

including bugs related ones and those for other purposes, such

as feature requests and help seeking. Secondly, GitHub reports

can be lengthy, e.g., it could be up to 332 sentences per

report on average [11]. Finally, given the vague nature of non-

functional bug, it is fundamentally difﬁcult to understand if the

related reports really reﬂect bugs. The above mean that, when

assigning or prioritizing the GitHub reports, it can take a long

time for developers to read and understand the bug reports,

hence delaying the potential ﬁxes to the destructive non-

functional bugs, especially when some of the key messages

are deeply hidden inside.

In light of the above, the problem we focus on in this paper

is the following: given a GitHub report for DL framework,

1https://tinyurl.com/ykufbpey.

2https://github.com/tensorﬂow/tensorﬂow/pulse/monthly

arXiv:2210.01855v1 [cs.SE] 4 Oct 2022

can we automatically learn and identify whether it is a non-

functional bug related report? Indeed, many existing classiﬁers

on bug report identiﬁcation can be directly applied. For

example, those that identify a particular type of bug report [12]

(e.g., long-lived bugs); those that predict whether a bug report

is bug-related [13]; and those that classify reports based on

labels [14, 15, 16]. However, in addition to the fact that these

works do not target the level of DL frameworks, they have

failed to handle some or all of the following challenges, which

are important in the report identiﬁcation:

•Semantics matter: Depending on the context, the same

words or code tokens in the GitHub reports can have

different meanings. Existing classiﬁers using statistical

learning algorithms [12] could fail to handle this poly-

semy.

•Multiple types of features exist: While most existing

classiﬁers consider the content (title and description) of

a GitHub report [12, 13, 14, 15, 16], other types of

features may also provide useful information, such as the

accumulated comments made by the participants before a

contributor is assigned to the report. Further, the mix of

code and natural language in a report can pose additional

challenge.

•Not all parts are equally relevant: Given a lengthy

GitHub report, not all of the words and sentences are

important in identifying the non-functional bug related

reports. Yet, existing work has often ignored such a

fact [14, 15, 16].

In this paper, we propose Multifaceted Hierarchical Non-

functional Bug Report Identification, dubbed MHNurf — an

end to end tool for automatically identifying non-functional

bug related reports for DL frameworks. Its core component

is a newly proposed Multifaceted Hierarchical Attention Net-

work (MHAN) in this work, which extends the Hierarchical

Attention Network (HAN) [17].

Contributions: To better identify non-functional bug re-

lated reports for DL frameworks, our contributions include:

•MHNurf learns the semantic knowledge by considering

the hierarchy and discriminating important and unimpor-

tant parts in GitHub reports.

•The MHAN in MHNurf considers multifacetedness in

GitHub reports, i.e., it learns up to ﬁve types of feature

(content,comment,code,command, and label) indepen-

dently.

•By using a dataset of 3,721 GitHub reports from ﬁve

DL frameworks (i.e., TensorFlow,PyTorch,Keras,

MXNet, and Caﬀe), the experiment results conﬁrm that

the title and description, which are fundamental parts of

a GitHub report, needs to be considered together as part

of the content feature.

•We also found that the combination of content,comment,

and code give the best result in MHNurf. Notably, the

multifacetedness in MHNurf has also helped to consid-

erably improve the prediction against the vanilla HAN in

MHNurf.

Title

Description Label

Code

Command

Comment

Fig. 1: An example of GitHub report from TensorFlow.

•By comparing with nine state-of-the-art classiﬁers, we

show that MHNurf is signiﬁcantly more accurate in

general (up to 71% AUC improvement). It is also ranked

as the 1st for 4 out of 5 DL frameworks; 2nd for one,

according to the Scott-Knott test [18]. We also conduct a

qualitative analysis on why MHNurf can perform better.

II. PROBLEM CONTEXT AND CHALLENGES

A. Context

DL frameworks hosted on GitHub allows participants and

users to submit issue reports, whose purpose includes, but

not limited to, bugs, pull request, feature request, and others.

Among those, the GitHub reports that are bug-related are often

of high importance to the community, especially the non-

functional bug related reports. Here, we distinguish two types

of users/developers on GitHub:

•Contributors: who are assigned to a report so that the

formal investigation starts.

•Participants: who have not been assigned to a report,

but are free to make comments on it.

Most commonly, after assignments, those GitHub reports

need to be reviewed by the contributors who will pick the

most important ones to investigate and ﬁx when necessary.

As shown in Figure 1, apart from the normal content (title

and description), a GitHub report may likely be commented on

by different participants before being assigned to a contributor.

Further, a label may also be added by the automatic bot or

by a contributor based on his/her ﬁrst impression. A formal

investigation begins when the report is assigned to someone.

Our work here is to automatically identify those GitHub re-

ports that are non-functional bug related for a DL framework.

This need not have to be done immediately when a report is

submitted but also can be achieved in a short period of time

after submission, as long as it is before the assignment. This

can then provide useful information for the bot (or human)

who assigns the GitHub reports in terms of which contributor

to assign and how to prioritize them, hence saving more

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MultifacetedHierarchicalReportIdenticationforNon-FunctionalBugsinDeepLearningFrameworksGuomingLongDepartmentofComputerScienceLoughboroughUniversityLoughborough,UnitedKingdomg.long@lboro.ac.ukTaoChenDepartmentofComputerScienceLoughboroughUniversityLoughborough,UnitedKingdomt.t.chen@lboro.ac.ukGeorg...

展开>> 收起<<

Multifaceted Hierarchical Report Identiﬁcation for Non-Functional Bugs in Deep Learning Frameworks Guoming Long.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multifaceted Hierarchical Report Identiﬁcation for Non-Functional Bugs in Deep Learning Frameworks Guoming Long

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: