Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing Metamorphic Testing
2025-04-15
8
0
3.02MB
13 页
10玖币
侵权投诉
Unveiling Hidden DNN Defects with Decision-Based
Metamorphic Testing∗
Yuanyuan Yuan
The Hong Kong University of Science
and Technology
Hong Kong, China
yyuanaq@cse.ust.hk
Qi Pang
The Hong Kong University of Science
and Technology
Hong Kong, China
qpangaa@cse.ust.hk
Shuai Wang†
The Hong Kong University of Science
and Technology
Hong Kong, China
shuaiw@cse.ust.hk
Abstract
Contemporary DNN testing works are frequently conducted using
metamorphic testing (MT). In general, de facto MT frameworks
mutate DNN input images using semantics-preserving mutations
and determine if DNNs can yield consistent predictions. Neverthe-
less, we nd that DNNs may rely on erroneous decisions (certain
components on the DNN inputs) to make predictions, which may still
retain the outputs by chance. Such DNN defects would be neglected
by existing MT frameworks. Erroneous decisions, however, would
likely result in successive mis-predictions over diverse images that
may exist in real-life scenarios.
This research aims to unveil the pervasiveness of hidden DNN
defects caused by incorrect DNN decisions (but retaining consistent
DNN predictions). To do so, we tailor and optimize modern eXplain-
able AI (XAI) techniques to identify visual concepts that represent
regions in an input image upon which the DNN makes predictions.
Then, we extend existing MT-based DNN testing frameworks to
check the consistency of DNN decisions made over a test input and
its mutated inputs. Our evaluation shows that existing MT frame-
works are oblivious to a considerable number of DNN defects caused
by erroneous decisions. We conduct human evaluations to justify
the validity of our ndings and to elucidate their characteristics.
Through the lens of DNN decision-based metamorphic relations,
we re-examine the eectiveness of metamorphic transformations
proposed by existing MT frameworks. We summarize lessons from
this study, which can provide insights and guidelines for future
DNN testing.
CCS Concepts
•Software and its engineering →
Software testing and debug-
ging.
∗The extended version of the ASE 2022 paper [66].
†Corresponding Author
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ASE ’22, October 10–14, 2022, Rochester, MI, USA
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9475-8/22/10. . . $15.00
https://doi.org/10.1145/3551349.3561157
Keywords
Deep learning testing
ACM Reference Format:
Yuanyuan Yuan, Qi Pang, and Shuai Wang. 2022. Unveiling Hidden DNN
Defects with Decision-Based Metamorphic Testing. In 37th IEEE/ACM In-
ternational Conference on Automated Software Engineering (ASE ’22), Octo-
ber 10–14, 2022, Rochester, MI, USA. ACM, New York, NY, USA, 13 pages.
https://doi.org/10.1145/3551349.3561157
1 Introduction
Metamorphic testing (MT) [
7
] has achieved a major success to com-
prehensively test deep neural networks (DNNs) without manually
annotating test inputs [
36
]. Given the inherent diculty of dening
explicit testing oracles for DNN models [
68
], DNN is often tested
using well-designed metamorphic relations (MRs): DNN inputs are
mutated into new test cases in a semantics-preserving manner
1
,
and DNN predictions over an input and its mutated inputs are com-
pared for consistency. DNN defects are characterized as violations
of DNN prediction consistency. However, despite the major suc-
cess of checking prediction consistency, we pose the following key
question to motivate this research:
“Is it always the case that a consistent prediction indi-
cates no DNN defect?”
In this research, we refer to the DNN’s focus on critical input
components as its decisions. Accordingly, DNN relies on such de-
cisions to make predictions (i.e., its outputs), e.g., classifying an
input image. Then, consider Fig. 1, in which we illustrate how a
contemporary MT framework misses a DNN defect. The tested
DNN predicts “hummingbird” for Fig. 1(a), and its utilized deci-
sions in Fig. 1(a) are marked in Fig. 1(b), depicting the correct scope
of a hummingbird’s head and body. When Fig. 1(a) is rotated as
in Fig. 1(c), the DNN still predicts “hummingbird.” Thus, existing
MRs based on DNN output consistency would regard the DNN as
“correct” for this case. Nevertheless, as in Fig. 1(d), the underlying
DNN decision is specious, as it is based on a ower whose contour
is similar to the contour of the ying hummingbird in Fig. 1(b).
Our preliminary study shows that existing MT-based DNN test-
ing frameworks, when only checking the consistency of DNN pre-
dictions, may overlook DNN defects due to incorrect DNN decisions,
i.e., relying on specious components in the DNN inputs for pre-
dictions. As revealed in this research, such incorrect decisions do
not always result in inconsistent DNN outputs, especially when
1
In this paper, semantics-preserving denotes that the contents in inputs and mutated
inputs are visually consistent, e.g., a cat is still a cat.
arXiv:2210.04942v1 [cs.SE] 10 Oct 2022
ASE ’22, October 10–14, 2022, Rochester, MI, USA Yuanyuan Yuan, Qi Pang, and Shuai Wang
(a) hummingbird image (b) decision based
on hummingbird (c) rotated image (d) incorrect decision
predict
“hummingbird”
based on a flower
predict
“hummingbird”
based on a hummingbird
Figure 1: DNN is making inconsistent, erroneous decisions
while happening to retain the same prediction. We simplify
the decision regions for readability.
the DNN is trained on a dataset with limited labels (e.g., two-label
classication), because DNN prediction is forced to choose among
pre-dened labels. Consider a DNN
𝜙
that is trained on evenly
distributed data and is performing a two-class (cat vs. dog) classi-
cation task. Assume that when random noise is applied to an image
𝑖
,
𝜙
ignores the cat in
𝑖
and randomly guesses a label. That is, it
still has a 50% chance of predicting the correct label. Despite the
fact that the tested DNN is awed, it is nonetheless considered as
“robust to noise” in many of these cases.
Moreover, we clarify that specious DNN decisions are hard to
detect using only the DNN predictions and condence scores. A
well-trained DNN predicts a pre-dened label with a condence
score that is typically much higher than the other labels, even when
given random noise as inputs. Thus, even if DNN is generating er-
roneous decisions, its outputs and accompanying condence scores
often lack an evident “pattern.” This diculty is also highlighted in
prior literatures [22, 43].
Overall, this work deems inconsistent DNN decisions exposed
by MT are specious and undesirable, as it is likely that a DNN, even
if it happens to correctly label an image (Fig. 1(c)), will eventually
mis-predict a hummingbird for pervasive images existing in real-
world scenarios. As a result, we argue that failing to account for
DNN decision defects may jeopardize the reliability of present MT
frameworks. We advocate that proper MRs should take DNN decisions
into consideration, rather than merely checking DNN predictions.
This work advocates to extend DNN prediction-based consis-
tency checking, which is extensively used in current MT, with
decision-based consistency checking. The enhancement is orthog-
onal to particular metamorphic transformations (e.g., image pixel
or ane transformations) implemented in existing MT-based DNN
testing frameworks, and can be smoothly incorporated by them.
Given an test image
𝑖
, we extract the decision, denoting regions
in
𝑖
, to depict how DNN makes prediction over
𝑖
. Each region is
referred to as a visual concept (e.g., a nose or a wheel), and DNN
predictions can be formulated as a voting scheme among visual con-
cepts [
18
,
34
]. To obtain visual concepts, we rst use eXplainable AI
(XAI) techniques to identify pixels in
𝑖
that positively contribute to
the DNN prediction. Then, we tailor and optimize a set of image pro-
cessing techniques to construct visual concepts from XAI-identied
pixels. We carefully reduce inherent inaccuracy of XAI techniques,
and largely enhance the readability of identied visual concepts.
By extending existing MT frameworks to support decision-based
consistency checking, we uncover many overlooked defects trig-
gered by inputs that result in inconsistent decisions but identical
predictions. Our ndings are justied by large-scale and compre-
hensive (in total 10,000) human evaluations, where the participants
are 15 Ph.D. students having research experiences related to DNNs
and 10 other Ph.D. and masters students of various backgrounds.
Our study encompasses ten DNNs over three datasets of dierent
scales and types (e.g., RGB and black-white images) which are all
popular in daily usage and have been extensively tested by previous
DNN testing research. We summarize key lessons of this research,
illustrating that existing MT, when only checking DNN prediction
consistency, may over-estimate the reliability of DNNs. We also
assess the strength of metamorphic transformations (e.g., pixel
mutation vs. adversarial perturbation) proposed by existing work
through the lens of our novel DNN decision view. Our ndings and
summarized lessons can provide insights for follow-up enhance-
ment of DNN testing. In sum, we make the following contributions.
•
We advocate that existing MT-based DNN testing should con-
sider how DNN makes decisions rather than merely checking
predictions. Accordingly, we extend existing MRs by check-
ing decision consistency to reveal DNN defects overlooked
by existing works.
•
Technically, we recast a DNN prediction as the outcome of a
voting process among visual concepts in an input. We tailor
and optimize image processing schemes to summarize visual
concepts from image pixels positively contributing DNN
predictions.
•
Our study and human evaluation illustrate that many defects
have been overlooked when only checking DNN prediction
consistency. Our ndings provide guidelines for users to
calibrate MT-based DNN testing results, and also highlight
further improvements that can be made by DNN testing.
Artifact Availability.
To support results verication and follow-
up research comparison, we released code, data, and supplementary
materials at https://github.com/Yuanyuan-Yuan/Decision-Oracle [
1
].
2 Preliminary and Motivation
2.1 Metamorphic Testing
DNNs are typically used to answer unknown questions where they
are anticipated to behave similarly to humans [
68
]. Given the diver-
sity of possible inputs encountered in real-life scenarios, obtaining
ground truth predictions in advance to assess DNN correctness is
dicult, if not impossible. Furthermore, even human experts may
disagree on expected outputs of certain edge cases.
MT is extensively employed to test DNNs without the need for
ground-truth or explicitly dened testing oracles [
7
]. Overall, each
MR in MT composes a metamorphic transformation
𝑀𝑅𝑡
and a
relation
𝑀𝑅𝑟
: each
𝑀𝑅𝑡
species a mutation scheme over a source
input to generate a follow-up test input, and the associated
𝑀𝑅𝑟
denes the relationship of expected outputs over the source and the
mutated input [
49
]. For instance, to test
𝑠𝑖𝑛(𝑥)
, we can construct
an MR such that its
𝑀𝑅𝑡
mutates an input
𝑥
into
𝜋−𝑥
, and the
𝑀𝑅𝑟
checks the the equality relation
𝑠𝑖𝑛(𝑥)=𝑠𝑖𝑛(𝜋−𝑥)
. In real-
world usage,
𝑀𝑅𝑟
usually denotes invariant program properties.
𝑀𝑅𝑟
should always hold when arbitrarily mutating
𝑥
using
𝑀𝑅𝑡
,
and a bug in 𝑠𝑖𝑛(𝑥)is detected whenever 𝑀𝑅𝑟is violated.
MT achieves major success in testing DNN models and infras-
tructures [
10
,
13
,
14
,
33
,
37
,
38
,
42
,
59
,
62
–
65
,
67
,
69
]. Given DNN
inputs are often images, MRs in this eld are often constructed to
Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing ASE ’22, October 10–14, 2022, Rochester, MI, USA
perform lightweight, semantics-preserving (visually consistent) im-
age mutations
𝑀𝑅𝑡
from dierent angles (see Sec. 5 for a literature
review of
𝑀𝑅𝑡
designed in previous works).
𝑀𝑅𝑟
is dened in a
simple and unied manner such that DNN predictions should be
consistent over an input image and its follow-up image generated
by using
𝑀𝑅𝑡
. Thus, violation of
𝑀𝑅𝑟
, denoting inconsistent DNN
predictions, are DNN defects.
Table 1: Four 𝑀𝑅𝑟based on DNN decisions (𝐷1, 𝐷2) and pre-
dictions (𝐿1, 𝐿2) over an input and its mutated input.
①𝐷1=𝐷2②𝐷1≠𝐷2③𝐷1≠𝐷2④𝐷1=𝐷2
𝐿1=𝐿2𝐿1≠𝐿2𝐿1=𝐿2𝐿1≠𝐿2
No defect? ✗ ✗ NA
2.2 Forming 𝑀𝑅𝑟with DNN Decisions
Without knowledge of a DNN’s decision procedure, we argue that
relying merely on its output (as how existing
𝑀𝑅𝑟
is formed) may
result in the omission of some defects. Given a pair of inputs
𝑖1
and
𝑖2
(
𝑖2
is mutated from
𝑖1
using a
𝑀𝑅𝑡
), suppose the DNN yields
prediction
𝐿1
based on decision
𝐷1
, yielding
𝐿2
based on decision
𝐷2
.
2
Then, we have four combinations of decisions/predictions, as
in Table 1.
①
denotes a correct prediction (from the perspective
of MT), whereas
②
represents that the DNN provides inconsistent
predictions
𝐿1≠𝐿2
. As introduced in Sec. 2.1, existing MT frame-
works rely on
②
to form
𝑀𝑅𝑟
, and we clarify that
④
is not feasible:
𝐷1=𝐷2;𝐿1≠𝐿2violates the nature of a DNN.
We explore a new focus to form
𝑀𝑅𝑟
, as in
③
, where DNNs make
inconsistent decisions (
𝐷1≠𝐷2
), but still happen to retain the same
prediction (
𝐿1=𝐿2
). We deem them as hidden DNN defects that
are incorrectly overlooked by existing works. Suppose a DNN
𝜙
an-
swers if hummingbirds appear in an image.
𝜙
is trained on a biased
dataset where all hummingbirds hover in the air, and therefore,
𝜙
wrongly relies on “vertical objects” to recognize hummingbird. For
the image in Fig. 1(a), it is properly predicted by
𝜙
as “yes” due to
the hovering hummingbird. After rotating this image for 90 degrees
as in Fig. 1(c), we nd that
𝜙
still responds “yes,” but makes decision
based the vertically presented ower in Fig. 1(d), which shares a
similar contour to most hummingbirds (e.g., by comparing with
the contour in Fig. 1(b)). In fact, we manually retain only the visual
concept in Fig. 1(d), and erase the remaining components of the
image. We conrm that
𝜙
predicts the image as a “hummingbird.”
Moreover, while
𝜙
is obviously susceptible to rotation,
𝑀𝑅𝑟
based
on
②
cannot uncover the defect. Nevertheless,
𝑀𝑅𝑟
based on
③
can
unveil this hidden aw.
Paper Structure.
In the rest of this paper, we formulate
𝐷
in
Sec. 2.3, and present technical solutions to constitute
𝐷
in Sec. 4.
We review literatures of MT-based DNN testing and their proposed
𝑀𝑅𝑡
in Sec. 5. Sec. 6 unveils the pervasiveness of hidden defects
falling in ③with empirical results.
Incompaliance of Ground Truth.
Following the notation above,
let the ground truth prediction be
𝐿𝐺
. It is widely seen that MT
may result in false negatives due to
𝐿𝐺≠(𝐿1=𝐿2)
. That is, a
DNN makes consistent albeit incorrect predictions over
𝑖1
and
𝑖2
.
Similarly, let the ground truth decision be
𝐷𝐺
, we clarify that false
negatives may occur, in case
𝐷𝐺≠(𝐷1=𝐷2)
. This may be due
2𝐷is formed by identifying DNN’s decision over the input 𝑖; see Sec. 4 for details.
to the incorrect (albeit consistent) decisions made by a DNN, or
the analysis errors of our employed XAI algorithms. Overall, MT
inherently omits considering
𝐷𝐺≠(𝐷1=𝐷2)
; detecting such
aws likely requires human annotations, which is highly costly in
real-world settings. On the other hand, as empirically assessed in
Sec. 6.1, 𝐷obtained in this work is accurate.
2.3 DNN Decision: A Pixel-Based View
We now introduce how a DNN makes decisions. Aligned with pre-
vious research, this paper primarily considers testing DNN image
classiers, and our following introduction uses image classication
as an example accordingly. Many common DNN tasks root from an
accurate image classication (see further discussion in Sec. 7). We
rst dene the Empty and Valid inputs below.
Denition 1
(
Empty
)
.
An input is empty if its components are mean-
ingless for humans, e.g., an image with random pixel values.
Denition 2
(
Valid
)
.
An input is valid if its components are mean-
ingful for humans, e.g., an image with human-recognizable objects.
Given an empty image
∅
, a well-trained DNN
𝜙
will have to
randomly predict a condence score for each class and the score for
class
𝑙
is
𝜙(∅)𝑙
. A valid input image
𝑖
can be viewed as introducing
the appearances of its components by changing pixel values over
∅
,
namely, setting
𝑖=∅ + 𝛿
. Accordingly, the output condence score
for class
𝑙
is transformed into
𝜙(𝑖)𝑙=𝜙(∅)𝑙+Δ𝑙
given all these
appearances in input. The machine learning community generally
views this procedure as a collaborative game among pixels of
𝑖
[
2
,
3
,
9
,
35
,
41
,
48
,
53
,
56
]. The true contribution of each pixel can be
computed via the Shapley value [
51
] — a well-established solution
in game theory. We present how to use Shapley value to
attribute
Δ𝑙
on
𝛿
below in Denition 3. We then discuss its approximation
and present cost analysis.
Denition 3
(
Attribution
)
.
Let each pixel change be
𝛿𝑝
and
Í𝛿𝑝=
𝛿
. Then, an
attribution
of
Δ𝑙
assigns a contribution score
𝑐𝑝
to each
𝛿𝑝, such that Í𝑐𝑝=Δ𝑙, where 𝑝represents one pixel.
From Pixel-Wise Contributions to Decision 𝐷.
A pixel
𝑝
pos-
itively supports the DNN prediction for class
𝑙
if its contribution
𝑐𝑝>0
. Therefore, collecting all pixels with positive contributions
can help scoping the decision
𝐷
upon which DNN
𝜙
relies when
processing
𝑖
and predicting
𝑙
. Instead of using pixels, however, we
abstract further to group pixels with positive contributions into
visual concepts (e.g., a nose or a wheel) in
𝑖
, and a DNN’s predic-
tions can be decomposed as a voting scheme among visual concepts.
Each decision
𝐷
comprises all of its visual concepts. We explain
how visual concepts are generated among pixels in Sec. 2.4.
Approximating Shapley Value in XAI.
As aforementioned, each
pixel in an image is considered as a player in the collaborative game
(i.e., making a prediction). Let all pixels in an image be
X
, then
calculating the exact Shapley value requires considering all subset
of
X
which results in a computational cost of
2|X|
and is infeasible
in practice. Nevertheless, modern attribution-based XAI [
35
] have
enabled practical approximation of Shapley value. In this research,
we use DeepLIFT [
53
], a popular XAI tool, to identify pixels
𝑝
in an
image that positively contribute to the decision of a DNN. Though
recent works may be able to identify more precise
attributions
than DeepLIFT, their computation is usually expensive [
8
,
35
,
53
].
摘要:
展开>>
收起<<
UnveilingHiddenDNNDefectswithDecision-BasedMetamorphicTesting∗YuanyuanYuanTheHongKongUniversityofScienceandTechnologyHongKong,Chinayyuanaq@cse.ust.hkQiPangTheHongKongUniversityofScienceandTechnologyHongKong,Chinaqpangaa@cse.ust.hkShuaiWang†TheHongKongUniversityofScienceandTechnologyHongKong,Chinashu...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
工程建设招标投标合同(附件)VIP免费
2024-11-15 16 -
工程建设招标投标合同(动员预付款银行保证书)VIP免费
2024-11-15 11 -
工程建设招标设标合同条件(第1部分)VIP免费
2024-11-15 11 -
工程建设招标设标合同合同条件(第3部分)VIP免费
2024-11-15 10 -
工程建设招标设标合同合同条件(第2部分)VIP免费
2024-11-15 13 -
工程建设监理委托合同VIP免费
2024-11-15 14 -
工程建设监理合同标准条件VIP免费
2024-11-15 11 -
工程技术资料目录VIP免费
2024-11-15 13 -
工程技术咨询服务合同VIP免费
2024-11-15 13 -
工程建设招标投标合同(投标邀请书)VIP免费
2024-11-15 35
分类:学术论文
价格:10玖币
属性:13 页
大小:3.02MB
格式:PDF
时间:2025-04-15


渝公网安备50010702506394