Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing Metamorphic Testing

2025-04-15 0 0 3.02MB 13 页 10玖币

侵权投诉

Unveiling Hidden DNN Defects with Decision-Based

Metamorphic Testing∗

Yuanyuan Yuan

The Hong Kong University of Science

and Technology

Hong Kong, China

yyuanaq@cse.ust.hk

Qi Pang

The Hong Kong University of Science

and Technology

Hong Kong, China

qpangaa@cse.ust.hk

Shuai Wang†

The Hong Kong University of Science

and Technology

Hong Kong, China

shuaiw@cse.ust.hk

Abstract

Contemporary DNN testing works are frequently conducted using

metamorphic testing (MT). In general, de facto MT frameworks

mutate DNN input images using semantics-preserving mutations

and determine if DNNs can yield consistent predictions. Neverthe-

less, we nd that DNNs may rely on erroneous decisions (certain

components on the DNN inputs) to make predictions, which may still

retain the outputs by chance. Such DNN defects would be neglected

by existing MT frameworks. Erroneous decisions, however, would

likely result in successive mis-predictions over diverse images that

may exist in real-life scenarios.

This research aims to unveil the pervasiveness of hidden DNN

defects caused by incorrect DNN decisions (but retaining consistent

DNN predictions). To do so, we tailor and optimize modern eXplain-

able AI (XAI) techniques to identify visual concepts that represent

regions in an input image upon which the DNN makes predictions.

Then, we extend existing MT-based DNN testing frameworks to

check the consistency of DNN decisions made over a test input and

its mutated inputs. Our evaluation shows that existing MT frame-

works are oblivious to a considerable number of DNN defects caused

by erroneous decisions. We conduct human evaluations to justify

the validity of our ndings and to elucidate their characteristics.

Through the lens of DNN decision-based metamorphic relations,

we re-examine the eectiveness of metamorphic transformations

proposed by existing MT frameworks. We summarize lessons from

this study, which can provide insights and guidelines for future

DNN testing.

CCS Concepts

•Software and its engineering →

Software testing and debug-

ging.

∗The extended version of the ASE 2022 paper [66].

†Corresponding Author

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

ASE ’22, October 10–14, 2022, Rochester, MI, USA

ACM ISBN 978-1-4503-9475-8/22/10. . . $15.00

https://doi.org/10.1145/3551349.3561157

Keywords

Deep learning testing

ACM Reference Format:

Yuanyuan Yuan, Qi Pang, and Shuai Wang. 2022. Unveiling Hidden DNN

Defects with Decision-Based Metamorphic Testing. In 37th IEEE/ACM In-

ternational Conference on Automated Software Engineering (ASE ’22), Octo-

ber 10–14, 2022, Rochester, MI, USA. ACM, New York, NY, USA, 13 pages.

https://doi.org/10.1145/3551349.3561157

1 Introduction

Metamorphic testing (MT) [

] has achieved a major success to com-

prehensively test deep neural networks (DNNs) without manually

annotating test inputs [

]. Given the inherent diculty of dening

explicit testing oracles for DNN models [

], DNN is often tested

using well-designed metamorphic relations (MRs): DNN inputs are

mutated into new test cases in a semantics-preserving manner

and DNN predictions over an input and its mutated inputs are com-

pared for consistency. DNN defects are characterized as violations

of DNN prediction consistency. However, despite the major suc-

cess of checking prediction consistency, we pose the following key

question to motivate this research:

“Is it always the case that a consistent prediction indi-

cates no DNN defect?”

In this research, we refer to the DNN’s focus on critical input

components as its decisions. Accordingly, DNN relies on such de-

cisions to make predictions (i.e., its outputs), e.g., classifying an

input image. Then, consider Fig. 1, in which we illustrate how a

contemporary MT framework misses a DNN defect. The tested

DNN predicts “hummingbird” for Fig. 1(a), and its utilized deci-

sions in Fig. 1(a) are marked in Fig. 1(b), depicting the correct scope

of a hummingbird’s head and body. When Fig. 1(a) is rotated as

in Fig. 1(c), the DNN still predicts “hummingbird.” Thus, existing

MRs based on DNN output consistency would regard the DNN as

“correct” for this case. Nevertheless, as in Fig. 1(d), the underlying

DNN decision is specious, as it is based on a ower whose contour

is similar to the contour of the ying hummingbird in Fig. 1(b).

Our preliminary study shows that existing MT-based DNN test-

ing frameworks, when only checking the consistency of DNN pre-

dictions, may overlook DNN defects due to incorrect DNN decisions,

i.e., relying on specious components in the DNN inputs for pre-

dictions. As revealed in this research, such incorrect decisions do

not always result in inconsistent DNN outputs, especially when

In this paper, semantics-preserving denotes that the contents in inputs and mutated

inputs are visually consistent, e.g., a cat is still a cat.

arXiv:2210.04942v1 [cs.SE] 10 Oct 2022

ASE ’22, October 10–14, 2022, Rochester, MI, USA Yuanyuan Yuan, Qi Pang, and Shuai Wang

(a) hummingbird image (b) decision based

on hummingbird (c) rotated image (d) incorrect decision

predict

“hummingbird”

based on a flower

predict

“hummingbird”

based on a hummingbird

Figure 1: DNN is making inconsistent, erroneous decisions

while happening to retain the same prediction. We simplify

the decision regions for readability.

the DNN is trained on a dataset with limited labels (e.g., two-label

classication), because DNN prediction is forced to choose among

pre-dened labels. Consider a DNN

𝜙

that is trained on evenly

distributed data and is performing a two-class (cat vs. dog) classi-

cation task. Assume that when random noise is applied to an image

𝑖

𝜙

ignores the cat in

𝑖

and randomly guesses a label. That is, it

still has a 50% chance of predicting the correct label. Despite the

fact that the tested DNN is awed, it is nonetheless considered as

“robust to noise” in many of these cases.

Moreover, we clarify that specious DNN decisions are hard to

detect using only the DNN predictions and condence scores. A

well-trained DNN predicts a pre-dened label with a condence

score that is typically much higher than the other labels, even when

given random noise as inputs. Thus, even if DNN is generating er-

roneous decisions, its outputs and accompanying condence scores

often lack an evident “pattern.” This diculty is also highlighted in

prior literatures [22, 43].

Overall, this work deems inconsistent DNN decisions exposed

by MT are specious and undesirable, as it is likely that a DNN, even

if it happens to correctly label an image (Fig. 1(c)), will eventually

mis-predict a hummingbird for pervasive images existing in real-

world scenarios. As a result, we argue that failing to account for

DNN decision defects may jeopardize the reliability of present MT

frameworks. We advocate that proper MRs should take DNN decisions

into consideration, rather than merely checking DNN predictions.

This work advocates to extend DNN prediction-based consis-

tency checking, which is extensively used in current MT, with

decision-based consistency checking. The enhancement is orthog-

onal to particular metamorphic transformations (e.g., image pixel

or ane transformations) implemented in existing MT-based DNN

testing frameworks, and can be smoothly incorporated by them.

Given an test image

𝑖

, we extract the decision, denoting regions

𝑖

, to depict how DNN makes prediction over

𝑖

. Each region is

referred to as a visual concept (e.g., a nose or a wheel), and DNN

predictions can be formulated as a voting scheme among visual con-

cepts [

]. To obtain visual concepts, we rst use eXplainable AI

(XAI) techniques to identify pixels in

𝑖

that positively contribute to

the DNN prediction. Then, we tailor and optimize a set of image pro-

cessing techniques to construct visual concepts from XAI-identied

pixels. We carefully reduce inherent inaccuracy of XAI techniques,

and largely enhance the readability of identied visual concepts.

By extending existing MT frameworks to support decision-based

consistency checking, we uncover many overlooked defects trig-

gered by inputs that result in inconsistent decisions but identical

predictions. Our ndings are justied by large-scale and compre-

hensive (in total 10,000) human evaluations, where the participants

are 15 Ph.D. students having research experiences related to DNNs

and 10 other Ph.D. and masters students of various backgrounds.

Our study encompasses ten DNNs over three datasets of dierent

scales and types (e.g., RGB and black-white images) which are all

popular in daily usage and have been extensively tested by previous

DNN testing research. We summarize key lessons of this research,

illustrating that existing MT, when only checking DNN prediction

consistency, may over-estimate the reliability of DNNs. We also

assess the strength of metamorphic transformations (e.g., pixel

mutation vs. adversarial perturbation) proposed by existing work

through the lens of our novel DNN decision view. Our ndings and

summarized lessons can provide insights for follow-up enhance-

ment of DNN testing. In sum, we make the following contributions.

•

We advocate that existing MT-based DNN testing should con-

sider how DNN makes decisions rather than merely checking

predictions. Accordingly, we extend existing MRs by check-

ing decision consistency to reveal DNN defects overlooked

by existing works.

•

Technically, we recast a DNN prediction as the outcome of a

voting process among visual concepts in an input. We tailor

and optimize image processing schemes to summarize visual

concepts from image pixels positively contributing DNN

predictions.

•

Our study and human evaluation illustrate that many defects

have been overlooked when only checking DNN prediction

consistency. Our ndings provide guidelines for users to

calibrate MT-based DNN testing results, and also highlight

further improvements that can be made by DNN testing.

Artifact Availability.

To support results verication and follow-

up research comparison, we released code, data, and supplementary

materials at https://github.com/Yuanyuan-Yuan/Decision-Oracle [

2 Preliminary and Motivation

2.1 Metamorphic Testing

DNNs are typically used to answer unknown questions where they

are anticipated to behave similarly to humans [

]. Given the diver-

sity of possible inputs encountered in real-life scenarios, obtaining

ground truth predictions in advance to assess DNN correctness is

dicult, if not impossible. Furthermore, even human experts may

disagree on expected outputs of certain edge cases.

MT is extensively employed to test DNNs without the need for

ground-truth or explicitly dened testing oracles [

]. Overall, each

MR in MT composes a metamorphic transformation

𝑀𝑅𝑡

and a

relation

𝑀𝑅𝑟

: each

𝑀𝑅𝑡

species a mutation scheme over a source

input to generate a follow-up test input, and the associated

𝑀𝑅𝑟

denes the relationship of expected outputs over the source and the

mutated input [

]. For instance, to test

𝑠𝑖𝑛(𝑥)

, we can construct

an MR such that its

𝑀𝑅𝑡

mutates an input

𝑥

into

𝜋−𝑥

, and the

𝑀𝑅𝑟

checks the the equality relation

𝑠𝑖𝑛(𝑥)=𝑠𝑖𝑛(𝜋−𝑥)

. In real-

world usage,

𝑀𝑅𝑟

usually denotes invariant program properties.

𝑀𝑅𝑟

should always hold when arbitrarily mutating

𝑥

using

𝑀𝑅𝑡

and a bug in 𝑠𝑖𝑛(𝑥)is detected whenever 𝑀𝑅𝑟is violated.

MT achieves major success in testing DNN models and infras-

tructures [

–

]. Given DNN

inputs are often images, MRs in this eld are often constructed to

Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing ASE ’22, October 10–14, 2022, Rochester, MI, USA

perform lightweight, semantics-preserving (visually consistent) im-

age mutations

𝑀𝑅𝑡

from dierent angles (see Sec. 5 for a literature

review of

𝑀𝑅𝑡

designed in previous works).

𝑀𝑅𝑟

is dened in a

simple and unied manner such that DNN predictions should be

consistent over an input image and its follow-up image generated

by using

𝑀𝑅𝑡

. Thus, violation of

𝑀𝑅𝑟

, denoting inconsistent DNN

predictions, are DNN defects.

Table 1: Four 𝑀𝑅𝑟based on DNN decisions (𝐷1, 𝐷2) and pre-

dictions (𝐿1, 𝐿2) over an input and its mutated input.

①𝐷1=𝐷2②𝐷1≠𝐷2③𝐷1≠𝐷2④𝐷1=𝐷2

𝐿1=𝐿2𝐿1≠𝐿2𝐿1=𝐿2𝐿1≠𝐿2

No defect? ✗ ✗ NA

2.2 Forming 𝑀𝑅𝑟with DNN Decisions

Without knowledge of a DNN’s decision procedure, we argue that

relying merely on its output (as how existing

𝑀𝑅𝑟

is formed) may

result in the omission of some defects. Given a pair of inputs

𝑖1

and

𝑖2

(

𝑖2

is mutated from

𝑖1

using a

𝑀𝑅𝑡

), suppose the DNN yields

prediction

𝐿1

based on decision

𝐷1

, yielding

𝐿2

based on decision

𝐷2

Then, we have four combinations of decisions/predictions, as

in Table 1.

①

denotes a correct prediction (from the perspective

of MT), whereas

②

represents that the DNN provides inconsistent

predictions

𝐿1≠𝐿2

. As introduced in Sec. 2.1, existing MT frame-

works rely on

②

to form

𝑀𝑅𝑟

, and we clarify that

④

is not feasible:

𝐷1=𝐷2;𝐿1≠𝐿2violates the nature of a DNN.

We explore a new focus to form

𝑀𝑅𝑟

, as in

③

, where DNNs make

inconsistent decisions (

𝐷1≠𝐷2

), but still happen to retain the same

prediction (

𝐿1=𝐿2

). We deem them as hidden DNN defects that

are incorrectly overlooked by existing works. Suppose a DNN

𝜙

an-

swers if hummingbirds appear in an image.

𝜙

is trained on a biased

dataset where all hummingbirds hover in the air, and therefore,

𝜙

wrongly relies on “vertical objects” to recognize hummingbird. For

the image in Fig. 1(a), it is properly predicted by

𝜙

as “yes” due to

the hovering hummingbird. After rotating this image for 90 degrees

as in Fig. 1(c), we nd that

𝜙

still responds “yes,” but makes decision

based the vertically presented ower in Fig. 1(d), which shares a

similar contour to most hummingbirds (e.g., by comparing with

the contour in Fig. 1(b)). In fact, we manually retain only the visual

concept in Fig. 1(d), and erase the remaining components of the

image. We conrm that

𝜙

predicts the image as a “hummingbird.”

Moreover, while

𝜙

is obviously susceptible to rotation,

𝑀𝑅𝑟

based

②

cannot uncover the defect. Nevertheless,

𝑀𝑅𝑟

based on

③

can

unveil this hidden aw.

Paper Structure.

In the rest of this paper, we formulate

𝐷

Sec. 2.3, and present technical solutions to constitute

𝐷

in Sec. 4.

We review literatures of MT-based DNN testing and their proposed

𝑀𝑅𝑡

in Sec. 5. Sec. 6 unveils the pervasiveness of hidden defects

falling in ③with empirical results.

Incompaliance of Ground Truth.

Following the notation above,

let the ground truth prediction be

𝐿𝐺

. It is widely seen that MT

may result in false negatives due to

𝐿𝐺≠(𝐿1=𝐿2)

. That is, a

DNN makes consistent albeit incorrect predictions over

𝑖1

and

𝑖2

Similarly, let the ground truth decision be

𝐷𝐺

, we clarify that false

negatives may occur, in case

𝐷𝐺≠(𝐷1=𝐷2)

. This may be due

2𝐷is formed by identifying DNN’s decision over the input 𝑖; see Sec. 4 for details.

to the incorrect (albeit consistent) decisions made by a DNN, or

the analysis errors of our employed XAI algorithms. Overall, MT

inherently omits considering

𝐷𝐺≠(𝐷1=𝐷2)

; detecting such

aws likely requires human annotations, which is highly costly in

real-world settings. On the other hand, as empirically assessed in

Sec. 6.1, 𝐷obtained in this work is accurate.

2.3 DNN Decision: A Pixel-Based View

We now introduce how a DNN makes decisions. Aligned with pre-

vious research, this paper primarily considers testing DNN image

classiers, and our following introduction uses image classication

as an example accordingly. Many common DNN tasks root from an

accurate image classication (see further discussion in Sec. 7). We

rst dene the Empty and Valid inputs below.

Denition 1

(

Empty

)

An input is empty if its components are mean-

ingless for humans, e.g., an image with random pixel values.

Denition 2

(

Valid

)

An input is valid if its components are mean-

ingful for humans, e.g., an image with human-recognizable objects.

Given an empty image

∅

, a well-trained DNN

𝜙

will have to

randomly predict a condence score for each class and the score for

class

𝑙

𝜙(∅)𝑙

. A valid input image

𝑖

can be viewed as introducing

the appearances of its components by changing pixel values over

∅

namely, setting

𝑖=∅ + 𝛿

. Accordingly, the output condence score

for class

𝑙

is transformed into

𝜙(𝑖)𝑙=𝜙(∅)𝑙+Δ𝑙

given all these

appearances in input. The machine learning community generally

views this procedure as a collaborative game among pixels of

𝑖

[

]. The true contribution of each pixel can be

computed via the Shapley value [

] — a well-established solution

in game theory. We present how to use Shapley value to

attribute

Δ𝑙

𝛿

below in Denition 3. We then discuss its approximation

and present cost analysis.

Denition 3

(

Attribution

)

Let each pixel change be

𝛿𝑝

and

Í𝛿𝑝=

𝛿

. Then, an

attribution

Δ𝑙

assigns a contribution score

𝑐𝑝

to each

𝛿𝑝, such that Í𝑐𝑝=Δ𝑙, where 𝑝represents one pixel.

From Pixel-Wise Contributions to Decision 𝐷.

A pixel

𝑝

pos-

itively supports the DNN prediction for class

𝑙

if its contribution

𝑐𝑝>0

. Therefore, collecting all pixels with positive contributions

can help scoping the decision

𝐷

upon which DNN

𝜙

relies when

processing

𝑖

and predicting

𝑙

. Instead of using pixels, however, we

abstract further to group pixels with positive contributions into

visual concepts (e.g., a nose or a wheel) in

𝑖

, and a DNN’s predic-

tions can be decomposed as a voting scheme among visual concepts.

Each decision

𝐷

comprises all of its visual concepts. We explain

how visual concepts are generated among pixels in Sec. 2.4.

Approximating Shapley Value in XAI.

As aforementioned, each

pixel in an image is considered as a player in the collaborative game

(i.e., making a prediction). Let all pixels in an image be

, then

calculating the exact Shapley value requires considering all subset

which results in a computational cost of

2|X|

and is infeasible

in practice. Nevertheless, modern attribution-based XAI [

] have

enabled practical approximation of Shapley value. In this research,

we use DeepLIFT [

], a popular XAI tool, to identify pixels

𝑝

in an

image that positively contribute to the decision of a DNN. Though

recent works may be able to identify more precise

attributions

than DeepLIFT, their computation is usually expensive [

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnveilingHiddenDNNDefectswithDecision-BasedMetamorphicTesting∗YuanyuanYuanTheHongKongUniversityofScienceandTechnologyHongKong,Chinayyuanaq@cse.ust.hkQiPangTheHongKongUniversityofScienceandTechnologyHongKong,Chinaqpangaa@cse.ust.hkShuaiWang†TheHongKongUniversityofScienceandTechnologyHongKong,Chinashu...

展开>> 收起<<

Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing Metamorphic Testing.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing Metamorphic Testing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: