Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing Metamorphic Testing

2025-04-15 0 0 3.02MB 13 页 10玖币
侵权投诉
Unveiling Hidden DNN Defects with Decision-Based
Metamorphic Testing
Yuanyuan Yuan
The Hong Kong University of Science
and Technology
Hong Kong, China
yyuanaq@cse.ust.hk
Qi Pang
The Hong Kong University of Science
and Technology
Hong Kong, China
qpangaa@cse.ust.hk
Shuai Wang
The Hong Kong University of Science
and Technology
Hong Kong, China
shuaiw@cse.ust.hk
Abstract
Contemporary DNN testing works are frequently conducted using
metamorphic testing (MT). In general, de facto MT frameworks
mutate DNN input images using semantics-preserving mutations
and determine if DNNs can yield consistent predictions. Neverthe-
less, we nd that DNNs may rely on erroneous decisions (certain
components on the DNN inputs) to make predictions, which may still
retain the outputs by chance. Such DNN defects would be neglected
by existing MT frameworks. Erroneous decisions, however, would
likely result in successive mis-predictions over diverse images that
may exist in real-life scenarios.
This research aims to unveil the pervasiveness of hidden DNN
defects caused by incorrect DNN decisions (but retaining consistent
DNN predictions). To do so, we tailor and optimize modern eXplain-
able AI (XAI) techniques to identify visual concepts that represent
regions in an input image upon which the DNN makes predictions.
Then, we extend existing MT-based DNN testing frameworks to
check the consistency of DNN decisions made over a test input and
its mutated inputs. Our evaluation shows that existing MT frame-
works are oblivious to a considerable number of DNN defects caused
by erroneous decisions. We conduct human evaluations to justify
the validity of our ndings and to elucidate their characteristics.
Through the lens of DNN decision-based metamorphic relations,
we re-examine the eectiveness of metamorphic transformations
proposed by existing MT frameworks. We summarize lessons from
this study, which can provide insights and guidelines for future
DNN testing.
CCS Concepts
Software and its engineering
Software testing and debug-
ging.
The extended version of the ASE 2022 paper [66].
Corresponding Author
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ASE ’22, October 10–14, 2022, Rochester, MI, USA
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9475-8/22/10. . . $15.00
https://doi.org/10.1145/3551349.3561157
Keywords
Deep learning testing
ACM Reference Format:
Yuanyuan Yuan, Qi Pang, and Shuai Wang. 2022. Unveiling Hidden DNN
Defects with Decision-Based Metamorphic Testing. In 37th IEEE/ACM In-
ternational Conference on Automated Software Engineering (ASE ’22), Octo-
ber 10–14, 2022, Rochester, MI, USA. ACM, New York, NY, USA, 13 pages.
https://doi.org/10.1145/3551349.3561157
1 Introduction
Metamorphic testing (MT) [
7
] has achieved a major success to com-
prehensively test deep neural networks (DNNs) without manually
annotating test inputs [
36
]. Given the inherent diculty of dening
explicit testing oracles for DNN models [
68
], DNN is often tested
using well-designed metamorphic relations (MRs): DNN inputs are
mutated into new test cases in a semantics-preserving manner
1
,
and DNN predictions over an input and its mutated inputs are com-
pared for consistency. DNN defects are characterized as violations
of DNN prediction consistency. However, despite the major suc-
cess of checking prediction consistency, we pose the following key
question to motivate this research:
Is it always the case that a consistent prediction indi-
cates no DNN defect?
In this research, we refer to the DNN’s focus on critical input
components as its decisions. Accordingly, DNN relies on such de-
cisions to make predictions (i.e., its outputs), e.g., classifying an
input image. Then, consider Fig. 1, in which we illustrate how a
contemporary MT framework misses a DNN defect. The tested
DNN predicts “hummingbird” for Fig. 1(a), and its utilized deci-
sions in Fig. 1(a) are marked in Fig. 1(b), depicting the correct scope
of a hummingbird’s head and body. When Fig. 1(a) is rotated as
in Fig. 1(c), the DNN still predicts “hummingbird.” Thus, existing
MRs based on DNN output consistency would regard the DNN as
“correct” for this case. Nevertheless, as in Fig. 1(d), the underlying
DNN decision is specious, as it is based on a ower whose contour
is similar to the contour of the ying hummingbird in Fig. 1(b).
Our preliminary study shows that existing MT-based DNN test-
ing frameworks, when only checking the consistency of DNN pre-
dictions, may overlook DNN defects due to incorrect DNN decisions,
i.e., relying on specious components in the DNN inputs for pre-
dictions. As revealed in this research, such incorrect decisions do
not always result in inconsistent DNN outputs, especially when
1
In this paper, semantics-preserving denotes that the contents in inputs and mutated
inputs are visually consistent, e.g., a cat is still a cat.
arXiv:2210.04942v1 [cs.SE] 10 Oct 2022
ASE ’22, October 10–14, 2022, Rochester, MI, USA Yuanyuan Yuan, Qi Pang, and Shuai Wang
(a) hummingbird image (b) decision based
on hummingbird (c) rotated image (d) incorrect decision
predict
hummingbird
based on a flower
predict
hummingbird
based on a hummingbird
Figure 1: DNN is making inconsistent, erroneous decisions
while happening to retain the same prediction. We simplify
the decision regions for readability.
the DNN is trained on a dataset with limited labels (e.g., two-label
classication), because DNN prediction is forced to choose among
pre-dened labels. Consider a DNN
𝜙
that is trained on evenly
distributed data and is performing a two-class (cat vs. dog) classi-
cation task. Assume that when random noise is applied to an image
𝑖
,
𝜙
ignores the cat in
𝑖
and randomly guesses a label. That is, it
still has a 50% chance of predicting the correct label. Despite the
fact that the tested DNN is awed, it is nonetheless considered as
“robust to noise” in many of these cases.
Moreover, we clarify that specious DNN decisions are hard to
detect using only the DNN predictions and condence scores. A
well-trained DNN predicts a pre-dened label with a condence
score that is typically much higher than the other labels, even when
given random noise as inputs. Thus, even if DNN is generating er-
roneous decisions, its outputs and accompanying condence scores
often lack an evident “pattern.” This diculty is also highlighted in
prior literatures [22, 43].
Overall, this work deems inconsistent DNN decisions exposed
by MT are specious and undesirable, as it is likely that a DNN, even
if it happens to correctly label an image (Fig. 1(c)), will eventually
mis-predict a hummingbird for pervasive images existing in real-
world scenarios. As a result, we argue that failing to account for
DNN decision defects may jeopardize the reliability of present MT
frameworks. We advocate that proper MRs should take DNN decisions
into consideration, rather than merely checking DNN predictions.
This work advocates to extend DNN prediction-based consis-
tency checking, which is extensively used in current MT, with
decision-based consistency checking. The enhancement is orthog-
onal to particular metamorphic transformations (e.g., image pixel
or ane transformations) implemented in existing MT-based DNN
testing frameworks, and can be smoothly incorporated by them.
Given an test image
𝑖
, we extract the decision, denoting regions
in
𝑖
, to depict how DNN makes prediction over
𝑖
. Each region is
referred to as a visual concept (e.g., a nose or a wheel), and DNN
predictions can be formulated as a voting scheme among visual con-
cepts [
18
,
34
]. To obtain visual concepts, we rst use eXplainable AI
(XAI) techniques to identify pixels in
𝑖
that positively contribute to
the DNN prediction. Then, we tailor and optimize a set of image pro-
cessing techniques to construct visual concepts from XAI-identied
pixels. We carefully reduce inherent inaccuracy of XAI techniques,
and largely enhance the readability of identied visual concepts.
By extending existing MT frameworks to support decision-based
consistency checking, we uncover many overlooked defects trig-
gered by inputs that result in inconsistent decisions but identical
predictions. Our ndings are justied by large-scale and compre-
hensive (in total 10,000) human evaluations, where the participants
are 15 Ph.D. students having research experiences related to DNNs
and 10 other Ph.D. and masters students of various backgrounds.
Our study encompasses ten DNNs over three datasets of dierent
scales and types (e.g., RGB and black-white images) which are all
popular in daily usage and have been extensively tested by previous
DNN testing research. We summarize key lessons of this research,
illustrating that existing MT, when only checking DNN prediction
consistency, may over-estimate the reliability of DNNs. We also
assess the strength of metamorphic transformations (e.g., pixel
mutation vs. adversarial perturbation) proposed by existing work
through the lens of our novel DNN decision view. Our ndings and
summarized lessons can provide insights for follow-up enhance-
ment of DNN testing. In sum, we make the following contributions.
We advocate that existing MT-based DNN testing should con-
sider how DNN makes decisions rather than merely checking
predictions. Accordingly, we extend existing MRs by check-
ing decision consistency to reveal DNN defects overlooked
by existing works.
Technically, we recast a DNN prediction as the outcome of a
voting process among visual concepts in an input. We tailor
and optimize image processing schemes to summarize visual
concepts from image pixels positively contributing DNN
predictions.
Our study and human evaluation illustrate that many defects
have been overlooked when only checking DNN prediction
consistency. Our ndings provide guidelines for users to
calibrate MT-based DNN testing results, and also highlight
further improvements that can be made by DNN testing.
Artifact Availability.
To support results verication and follow-
up research comparison, we released code, data, and supplementary
materials at https://github.com/Yuanyuan-Yuan/Decision-Oracle [
1
].
2 Preliminary and Motivation
2.1 Metamorphic Testing
DNNs are typically used to answer unknown questions where they
are anticipated to behave similarly to humans [
68
]. Given the diver-
sity of possible inputs encountered in real-life scenarios, obtaining
ground truth predictions in advance to assess DNN correctness is
dicult, if not impossible. Furthermore, even human experts may
disagree on expected outputs of certain edge cases.
MT is extensively employed to test DNNs without the need for
ground-truth or explicitly dened testing oracles [
7
]. Overall, each
MR in MT composes a metamorphic transformation
𝑀𝑅𝑡
and a
relation
𝑀𝑅𝑟
: each
𝑀𝑅𝑡
species a mutation scheme over a source
input to generate a follow-up test input, and the associated
𝑀𝑅𝑟
denes the relationship of expected outputs over the source and the
mutated input [
49
]. For instance, to test
𝑠𝑖𝑛(𝑥)
, we can construct
an MR such that its
𝑀𝑅𝑡
mutates an input
𝑥
into
𝜋𝑥
, and the
𝑀𝑅𝑟
checks the the equality relation
𝑠𝑖𝑛(𝑥)=𝑠𝑖𝑛(𝜋𝑥)
. In real-
world usage,
𝑀𝑅𝑟
usually denotes invariant program properties.
𝑀𝑅𝑟
should always hold when arbitrarily mutating
𝑥
using
𝑀𝑅𝑡
,
and a bug in 𝑠𝑖𝑛(𝑥)is detected whenever 𝑀𝑅𝑟is violated.
MT achieves major success in testing DNN models and infras-
tructures [
10
,
13
,
14
,
33
,
37
,
38
,
42
,
59
,
62
65
,
67
,
69
]. Given DNN
inputs are often images, MRs in this eld are often constructed to
Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing ASE ’22, October 10–14, 2022, Rochester, MI, USA
perform lightweight, semantics-preserving (visually consistent) im-
age mutations
𝑀𝑅𝑡
from dierent angles (see Sec. 5 for a literature
review of
𝑀𝑅𝑡
designed in previous works).
𝑀𝑅𝑟
is dened in a
simple and unied manner such that DNN predictions should be
consistent over an input image and its follow-up image generated
by using
𝑀𝑅𝑡
. Thus, violation of
𝑀𝑅𝑟
, denoting inconsistent DNN
predictions, are DNN defects.
Table 1: Four 𝑀𝑅𝑟based on DNN decisions (𝐷1, 𝐷2) and pre-
dictions (𝐿1, 𝐿2) over an input and its mutated input.
𝐷1=𝐷2𝐷1𝐷2𝐷1𝐷2𝐷1=𝐷2
𝐿1=𝐿2𝐿1𝐿2𝐿1=𝐿2𝐿1𝐿2
No defect? NA
2.2 Forming 𝑀𝑅𝑟with DNN Decisions
Without knowledge of a DNN’s decision procedure, we argue that
relying merely on its output (as how existing
𝑀𝑅𝑟
is formed) may
result in the omission of some defects. Given a pair of inputs
𝑖1
and
𝑖2
(
𝑖2
is mutated from
𝑖1
using a
𝑀𝑅𝑡
), suppose the DNN yields
prediction
𝐿1
based on decision
𝐷1
, yielding
𝐿2
based on decision
𝐷2
.
2
Then, we have four combinations of decisions/predictions, as
in Table 1.
denotes a correct prediction (from the perspective
of MT), whereas
represents that the DNN provides inconsistent
predictions
𝐿1𝐿2
. As introduced in Sec. 2.1, existing MT frame-
works rely on
to form
𝑀𝑅𝑟
, and we clarify that
is not feasible:
𝐷1=𝐷2;𝐿1𝐿2violates the nature of a DNN.
We explore a new focus to form
𝑀𝑅𝑟
, as in
, where DNNs make
inconsistent decisions (
𝐷1𝐷2
), but still happen to retain the same
prediction (
𝐿1=𝐿2
). We deem them as hidden DNN defects that
are incorrectly overlooked by existing works. Suppose a DNN
𝜙
an-
swers if hummingbirds appear in an image.
𝜙
is trained on a biased
dataset where all hummingbirds hover in the air, and therefore,
𝜙
wrongly relies on “vertical objects” to recognize hummingbird. For
the image in Fig. 1(a), it is properly predicted by
𝜙
as “yes” due to
the hovering hummingbird. After rotating this image for 90 degrees
as in Fig. 1(c), we nd that
𝜙
still responds “yes,” but makes decision
based the vertically presented ower in Fig. 1(d), which shares a
similar contour to most hummingbirds (e.g., by comparing with
the contour in Fig. 1(b)). In fact, we manually retain only the visual
concept in Fig. 1(d), and erase the remaining components of the
image. We conrm that
𝜙
predicts the image as a “hummingbird.
Moreover, while
𝜙
is obviously susceptible to rotation,
𝑀𝑅𝑟
based
on
cannot uncover the defect. Nevertheless,
𝑀𝑅𝑟
based on
can
unveil this hidden aw.
Paper Structure.
In the rest of this paper, we formulate
𝐷
in
Sec. 2.3, and present technical solutions to constitute
𝐷
in Sec. 4.
We review literatures of MT-based DNN testing and their proposed
𝑀𝑅𝑡
in Sec. 5. Sec. 6 unveils the pervasiveness of hidden defects
falling in with empirical results.
Incompaliance of Ground Truth.
Following the notation above,
let the ground truth prediction be
𝐿𝐺
. It is widely seen that MT
may result in false negatives due to
𝐿𝐺(𝐿1=𝐿2)
. That is, a
DNN makes consistent albeit incorrect predictions over
𝑖1
and
𝑖2
.
Similarly, let the ground truth decision be
𝐷𝐺
, we clarify that false
negatives may occur, in case
𝐷𝐺(𝐷1=𝐷2)
. This may be due
2𝐷is formed by identifying DNN’s decision over the input 𝑖; see Sec. 4 for details.
to the incorrect (albeit consistent) decisions made by a DNN, or
the analysis errors of our employed XAI algorithms. Overall, MT
inherently omits considering
𝐷𝐺(𝐷1=𝐷2)
; detecting such
aws likely requires human annotations, which is highly costly in
real-world settings. On the other hand, as empirically assessed in
Sec. 6.1, 𝐷obtained in this work is accurate.
2.3 DNN Decision: A Pixel-Based View
We now introduce how a DNN makes decisions. Aligned with pre-
vious research, this paper primarily considers testing DNN image
classiers, and our following introduction uses image classication
as an example accordingly. Many common DNN tasks root from an
accurate image classication (see further discussion in Sec. 7). We
rst dene the Empty and Valid inputs below.
Denition 1
(
Empty
)
.
An input is empty if its components are mean-
ingless for humans, e.g., an image with random pixel values.
Denition 2
(
Valid
)
.
An input is valid if its components are mean-
ingful for humans, e.g., an image with human-recognizable objects.
Given an empty image
, a well-trained DNN
𝜙
will have to
randomly predict a condence score for each class and the score for
class
𝑙
is
𝜙(∅)𝑙
. A valid input image
𝑖
can be viewed as introducing
the appearances of its components by changing pixel values over
,
namely, setting
𝑖= + 𝛿
. Accordingly, the output condence score
for class
𝑙
is transformed into
𝜙(𝑖)𝑙=𝜙(∅)𝑙+Δ𝑙
given all these
appearances in input. The machine learning community generally
views this procedure as a collaborative game among pixels of
𝑖
[
2
,
3
,
9
,
35
,
41
,
48
,
53
,
56
]. The true contribution of each pixel can be
computed via the Shapley value [
51
] — a well-established solution
in game theory. We present how to use Shapley value to
attribute
Δ𝑙
on
𝛿
below in Denition 3. We then discuss its approximation
and present cost analysis.
Denition 3
(
Attribution
)
.
Let each pixel change be
𝛿𝑝
and
Í𝛿𝑝=
𝛿
. Then, an
attribution
of
Δ𝑙
assigns a contribution score
𝑐𝑝
to each
𝛿𝑝, such that Í𝑐𝑝=Δ𝑙, where 𝑝represents one pixel.
From Pixel-Wise Contributions to Decision 𝐷.
A pixel
𝑝
pos-
itively supports the DNN prediction for class
𝑙
if its contribution
𝑐𝑝>0
. Therefore, collecting all pixels with positive contributions
can help scoping the decision
𝐷
upon which DNN
𝜙
relies when
processing
𝑖
and predicting
𝑙
. Instead of using pixels, however, we
abstract further to group pixels with positive contributions into
visual concepts (e.g., a nose or a wheel) in
𝑖
, and a DNN’s predic-
tions can be decomposed as a voting scheme among visual concepts.
Each decision
𝐷
comprises all of its visual concepts. We explain
how visual concepts are generated among pixels in Sec. 2.4.
Approximating Shapley Value in XAI.
As aforementioned, each
pixel in an image is considered as a player in the collaborative game
(i.e., making a prediction). Let all pixels in an image be
X
, then
calculating the exact Shapley value requires considering all subset
of
X
which results in a computational cost of
2|X|
and is infeasible
in practice. Nevertheless, modern attribution-based XAI [
35
] have
enabled practical approximation of Shapley value. In this research,
we use DeepLIFT [
53
], a popular XAI tool, to identify pixels
𝑝
in an
image that positively contribute to the decision of a DNN. Though
recent works may be able to identify more precise
attributions
than DeepLIFT, their computation is usually expensive [
8
,
35
,
53
].
摘要:

UnveilingHiddenDNNDefectswithDecision-BasedMetamorphicTesting∗YuanyuanYuanTheHongKongUniversityofScienceandTechnologyHongKong,Chinayyuanaq@cse.ust.hkQiPangTheHongKongUniversityofScienceandTechnologyHongKong,Chinaqpangaa@cse.ust.hkShuaiWang†TheHongKongUniversityofScienceandTechnologyHongKong,Chinashu...

展开>> 收起<<
Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing Metamorphic Testing.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:学术论文 价格:10玖币 属性:13 页 大小:3.02MB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注