
Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing ASE ’22, October 10–14, 2022, Rochester, MI, USA
perform lightweight, semantics-preserving (visually consistent) im-
age mutations
𝑀𝑅𝑡
from dierent angles (see Sec. 5 for a literature
review of
𝑀𝑅𝑡
designed in previous works).
𝑀𝑅𝑟
is dened in a
simple and unied manner such that DNN predictions should be
consistent over an input image and its follow-up image generated
by using
𝑀𝑅𝑡
. Thus, violation of
𝑀𝑅𝑟
, denoting inconsistent DNN
predictions, are DNN defects.
Table 1: Four 𝑀𝑅𝑟based on DNN decisions (𝐷1, 𝐷2) and pre-
dictions (𝐿1, 𝐿2) over an input and its mutated input.
①𝐷1=𝐷2②𝐷1≠𝐷2③𝐷1≠𝐷2④𝐷1=𝐷2
𝐿1=𝐿2𝐿1≠𝐿2𝐿1=𝐿2𝐿1≠𝐿2
No defect? ✗ ✗ NA
2.2 Forming 𝑀𝑅𝑟with DNN Decisions
Without knowledge of a DNN’s decision procedure, we argue that
relying merely on its output (as how existing
𝑀𝑅𝑟
is formed) may
result in the omission of some defects. Given a pair of inputs
𝑖1
and
𝑖2
(
𝑖2
is mutated from
𝑖1
using a
𝑀𝑅𝑡
), suppose the DNN yields
prediction
𝐿1
based on decision
𝐷1
, yielding
𝐿2
based on decision
𝐷2
.
2
Then, we have four combinations of decisions/predictions, as
in Table 1.
①
denotes a correct prediction (from the perspective
of MT), whereas
②
represents that the DNN provides inconsistent
predictions
𝐿1≠𝐿2
. As introduced in Sec. 2.1, existing MT frame-
works rely on
②
to form
𝑀𝑅𝑟
, and we clarify that
④
is not feasible:
𝐷1=𝐷2;𝐿1≠𝐿2violates the nature of a DNN.
We explore a new focus to form
𝑀𝑅𝑟
, as in
③
, where DNNs make
inconsistent decisions (
𝐷1≠𝐷2
), but still happen to retain the same
prediction (
𝐿1=𝐿2
). We deem them as hidden DNN defects that
are incorrectly overlooked by existing works. Suppose a DNN
𝜙
an-
swers if hummingbirds appear in an image.
𝜙
is trained on a biased
dataset where all hummingbirds hover in the air, and therefore,
𝜙
wrongly relies on “vertical objects” to recognize hummingbird. For
the image in Fig. 1(a), it is properly predicted by
𝜙
as “yes” due to
the hovering hummingbird. After rotating this image for 90 degrees
as in Fig. 1(c), we nd that
𝜙
still responds “yes,” but makes decision
based the vertically presented ower in Fig. 1(d), which shares a
similar contour to most hummingbirds (e.g., by comparing with
the contour in Fig. 1(b)). In fact, we manually retain only the visual
concept in Fig. 1(d), and erase the remaining components of the
image. We conrm that
𝜙
predicts the image as a “hummingbird.”
Moreover, while
𝜙
is obviously susceptible to rotation,
𝑀𝑅𝑟
based
on
②
cannot uncover the defect. Nevertheless,
𝑀𝑅𝑟
based on
③
can
unveil this hidden aw.
Paper Structure.
In the rest of this paper, we formulate
𝐷
in
Sec. 2.3, and present technical solutions to constitute
𝐷
in Sec. 4.
We review literatures of MT-based DNN testing and their proposed
𝑀𝑅𝑡
in Sec. 5. Sec. 6 unveils the pervasiveness of hidden defects
falling in ③with empirical results.
Incompaliance of Ground Truth.
Following the notation above,
let the ground truth prediction be
𝐿𝐺
. It is widely seen that MT
may result in false negatives due to
𝐿𝐺≠(𝐿1=𝐿2)
. That is, a
DNN makes consistent albeit incorrect predictions over
𝑖1
and
𝑖2
.
Similarly, let the ground truth decision be
𝐷𝐺
, we clarify that false
negatives may occur, in case
𝐷𝐺≠(𝐷1=𝐷2)
. This may be due
2𝐷is formed by identifying DNN’s decision over the input 𝑖; see Sec. 4 for details.
to the incorrect (albeit consistent) decisions made by a DNN, or
the analysis errors of our employed XAI algorithms. Overall, MT
inherently omits considering
𝐷𝐺≠(𝐷1=𝐷2)
; detecting such
aws likely requires human annotations, which is highly costly in
real-world settings. On the other hand, as empirically assessed in
Sec. 6.1, 𝐷obtained in this work is accurate.
2.3 DNN Decision: A Pixel-Based View
We now introduce how a DNN makes decisions. Aligned with pre-
vious research, this paper primarily considers testing DNN image
classiers, and our following introduction uses image classication
as an example accordingly. Many common DNN tasks root from an
accurate image classication (see further discussion in Sec. 7). We
rst dene the Empty and Valid inputs below.
Denition 1
(
Empty
)
.
An input is empty if its components are mean-
ingless for humans, e.g., an image with random pixel values.
Denition 2
(
Valid
)
.
An input is valid if its components are mean-
ingful for humans, e.g., an image with human-recognizable objects.
Given an empty image
∅
, a well-trained DNN
𝜙
will have to
randomly predict a condence score for each class and the score for
class
𝑙
is
𝜙(∅)𝑙
. A valid input image
𝑖
can be viewed as introducing
the appearances of its components by changing pixel values over
∅
,
namely, setting
𝑖=∅ + 𝛿
. Accordingly, the output condence score
for class
𝑙
is transformed into
𝜙(𝑖)𝑙=𝜙(∅)𝑙+Δ𝑙
given all these
appearances in input. The machine learning community generally
views this procedure as a collaborative game among pixels of
𝑖
[
2
,
3
,
9
,
35
,
41
,
48
,
53
,
56
]. The true contribution of each pixel can be
computed via the Shapley value [
51
] — a well-established solution
in game theory. We present how to use Shapley value to
attribute
Δ𝑙
on
𝛿
below in Denition 3. We then discuss its approximation
and present cost analysis.
Denition 3
(
Attribution
)
.
Let each pixel change be
𝛿𝑝
and
Í𝛿𝑝=
𝛿
. Then, an
attribution
of
Δ𝑙
assigns a contribution score
𝑐𝑝
to each
𝛿𝑝, such that Í𝑐𝑝=Δ𝑙, where 𝑝represents one pixel.
From Pixel-Wise Contributions to Decision 𝐷.
A pixel
𝑝
pos-
itively supports the DNN prediction for class
𝑙
if its contribution
𝑐𝑝>0
. Therefore, collecting all pixels with positive contributions
can help scoping the decision
𝐷
upon which DNN
𝜙
relies when
processing
𝑖
and predicting
𝑙
. Instead of using pixels, however, we
abstract further to group pixels with positive contributions into
visual concepts (e.g., a nose or a wheel) in
𝑖
, and a DNN’s predic-
tions can be decomposed as a voting scheme among visual concepts.
Each decision
𝐷
comprises all of its visual concepts. We explain
how visual concepts are generated among pixels in Sec. 2.4.
Approximating Shapley Value in XAI.
As aforementioned, each
pixel in an image is considered as a player in the collaborative game
(i.e., making a prediction). Let all pixels in an image be
X
, then
calculating the exact Shapley value requires considering all subset
of
X
which results in a computational cost of
2|X|
and is infeasible
in practice. Nevertheless, modern attribution-based XAI [
35
] have
enabled practical approximation of Shapley value. In this research,
we use DeepLIFT [
53
], a popular XAI tool, to identify pixels
𝑝
in an
image that positively contribute to the decision of a DNN. Though
recent works may be able to identify more precise
attributions
than DeepLIFT, their computation is usually expensive [
8
,
35
,
53
].