PUSH-PULL CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION Xuanjun Chen1 Haibin Wu23 Helen Meng34y Hung-yi Lee2y Jyh-Shing Roger Jang1y

2025-05-02 1 0 1.87MB 8 页 10玖币

侵权投诉

PUSH-PULL: CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL

ACTIVE SPEAKER DETECTION

Xuanjun Chen1∗, Haibin Wu23∗, Helen Meng34†, Hung-yi Lee2†, Jyh-Shing Roger Jang1†

1Department. of Computer Science Information Engineering, National Taiwan University

2Graduate Institute of Communication Engineering, National Taiwan University

3Centre for Perceptual and Interactive Intelligence, The Chinese University of Hong Kong

4Human-Computer Communications Laboratory, The Chinese University of Hong Kong

{r09922165, f07921092, hungyilee}@ntu.edu.tw, hmmeng@se.cuhk.edu.hk,jang@csie.ntu.edu.tw

ABSTRACT

Audio-visual active speaker detection (AVASD) is well-developed,

and now is an indispensable front-end for several multi-modal ap-

plications. However, to the best of our knowledge, the adversarial

robustness of AVASD models hasn’t been investigated, not to men-

tion the effective defense against such attacks. In this paper, we

are the ﬁrst to reveal the vulnerability of AVASD models under

audio-only, visual-only, and audio-visual adversarial attacks through

extensive experiments. What’s more, we also propose a novel audio-

visual interaction loss (AVIL) for making attackers difﬁcult to ﬁnd

feasible adversarial examples under an allocated attack budget. The

loss aims at pushing the inter-class embeddings to be dispersed,

namely non-speech and speech clusters, sufﬁciently disentangled,

and pulling the intra-class embeddings as close as possible to keep

them compact. Experimental results show the AVIL outperforms the

adversarial training by 33.14 mAP (%) under multi-modal attacks.

Index Terms—Audio-visual active speaker detection, multi-

modal adversarial attack, adversarial robustness

1. INTRODUCTION

Active Speaker Detection (ASD) seeks to detect who is speaking

in a visual scene containing one or more speakers [1, 2]. Recently,

audio-visual ASD (AVASD), which integrates audio-visual informa-

tion by learning the relationship between speech and facial motion,

effectively improves the performance of ASD, and AVASD has be-

come more indispensable as a front-end for multi-modal applica-

tions. However, to the best of our knowledge, whether the AVASD

models are robust against adversarial attacks has not been investi-

gated previously, not to mention the effective defense method against

such multi-modal attacks.

Crafting indistinguishable adversarial noise, adding such noise to

clean samples to generate adversarial samples, and then manipulat-

ing the AI models by such samples, is called adversarial attack [3].

Previous adversarial attacks usually focus on single-modal applica-

tions. For visual-modal attacks, Szegedy et al. ﬁrst propose to at-

tack state-of-the-art image classiﬁcation models [3] in 2013. For

the speech modality, models including automatic speaker veriﬁca-

tion (ASV) systems [4–18], anti-spooﬁng models for ASV [19–23],

and automatic speech recognition models [24–30] are also vulnera-

ble to adversarial attacks. For audio-visual learning, Li et al. [31]

∗Equal contribution. †Equal correspondence.

studied the audio-visual adversarial robustness of the general sound

event detection model but only considered single- or multi-modal

attacks under an attack method.

Given that AVASD is now ubiquitously implemented as a front-

end for a variety of multi-modal downstream models, the dangerous

adversarial noise may manipulate the AVASD front-end to commit

errors, which will accumulate and propagate to the downstream ap-

plications. Hence it is of high priority that we mitigate the adver-

sarial vulnerability of AVASD and ensure robustness against such

attacks. This paper investigates the susceptibility of AVASD models

to adversarial attacks and then proposes a novel defense method to

improve their robustness. Our contributions are summarized in two

folds: 1). To the best of our knowledge, this is the ﬁrst work to reveal

the vulnerability of AVASD models under three kinds of attacks, in-

cluding audio-only, visual-only, and audio-visual adversarial attacks

by extensive experiments. 2). We also propose a novel audio-visual

interaction loss (AVIL), which aims at pushing the inter-class em-

beddings, namely the non-speech and speech clusters, sufﬁciently

disentangled, and pulling the intra-class embeddings as close as pos-

sible. Expanding the inter-class dispersion and enhancing the intra-

class compactness will make it difﬁcult for attackers to ﬁnd feasible

adversarial samples to go beyond the decision boundary within the

allocated attacking budget. The experimental results illustrate that

the brand-new audio-visual interaction loss effectively strengthens

the invulnerability of AVASD models.

2. BACKGROUND

2.1. Audio-Visual Active Speaker Detection

The ASD task has been studied using audio, video, or the fusion of

both. For audio, the voice activity detector [32, 33] is often used to

detect the presence of speech. However, in real-world scenarios, the

speech signal from the microphones is easily mixed with overlapping

speech and background noise, which will hinder the effectiveness of

voice activity detection. The visual part [34,35] mainly analyzes the

face and upper body of a person to determine whether the person is

speaking, but the performance is limited due to some non-speech ac-

tivities, e.g. licking lips, eating, and grinning. The audio-visual pro-

cessing refers to the combination of audio and visual parts [36, 37],

and allows learning across modalities about the relationship between

audio speech and facial motions. With valuable support from size-

able datasets, e.g. AVA-Active Speaker, and the AVA Challenge se-

ries launched since 2018, a variety of high-performance models for

arXiv:2210.00753v1 [cs.SD] 3 Oct 2022

Audio

Temporal

Encoder

Visual

Temporal

Encoder

Cross-attention

Self-attention

Feature Extraction Frontend Speaker Detection Backend

...

AVASD

Model

AVASD

Model

1 1 0 ... 0

0 0 1 ... 1

(a) (b)

Fig. 1. (a) The TalkNet framework. xaand xvare the audio and visual inputs, respectively. ⊗denotes the concatenation procedure. LCEa,

LCEvand LCEav are the cross entropy losses for audio-only prediction head, visual-only prediction head, and audio-visual prediction head,

respectively. (b) The audio-visual attack framework for AVASD. xaand xvare the audio and visual samples respectively, yis the ground-

truth for the multi-sensory input {xa, xv}.δaand δvare the adversarial perturbations for xaand xv, respectively. ˜yis the prediction for the

adversarial samples {˜xa,˜xv}. The adversarial attack aims at maximizing the difference between yand ˜y.

AVASD have emerged recently [38–48]. In real-world user authenti-

cation systems, AVASD can be used as a front-end task to assure se-

curity veriﬁcation for speaker veriﬁcation [49]. For AVASD system,

there are four typical cases, such as speech without target speaker, no

audible speaker, speaker without speech and speech with the target

speaker. Only the speech with target speaker is labeled as speaking.

Attackers possibly use some single modal attack methods or com-

bine them to make AVASD produce wrong predictions in the other

three cases, which is dangerous. However, people have not yet seen

investigations on the adversarial robustness of AVASD model.

2.2. Adversarial Attacks

Adversarial attack is to manipulate a well-trained model to give

wrong predictions by an adversarial sample, which is imperceptible

by humans, compared with the original (unmodiﬁed) counterpart.

Mathematically, given a clean sample xand the ground-truth label

y, attack algorithms seek to ﬁnd a sufﬁciently small perturbation δ

such that: ˜x=x+δ, where ˜xis the adversarial sample that can fool

the model to produce the wrong prediction ˜y. We can ﬁnd a suitable

δby solving the following objective function:

arg max

L(˜x, y, θ),

s.t.||δ||p≤,

(1)

where L(·)denotes the objective function pre-deﬁned by the attack-

ers, which is usually set to maximize the difference between yand

the model’s ﬁnal prediction given ˜x,is the allowed perturbation

budget, and || · ||pdenotes the p-norm, which is usually considered

to be a l∞-ball or l2-ball centered at x. We evaluate the AVASD

models’ vulnerability with l∞-boundary adversarial noise, as it is

widely used as a standard evaluation boundary for adversarial ro-

bustness [50]. To solve the optimization problem as shown in Equa-

tion 1, we choose three widely used attack methods to evaluate the

robustness of AVASD models, and the details are summarized below.

Basic Iterative Method (BIM) BIM [51] is a method with iterative

updates as follows:

xm=clip(xm−1+α·sign(∇xm−1L(xm−1, y, θ))),

for m = 1, ..., M, (2)

where x0starts from the original sample x,αis the step size, Mis

the number of iterations and the clip(·)function applies element-

wise clipping to make ||xm−1−x||∞≤,  ≥0∈R. The per-

turbed example xMis the ﬁnal adversarial example.

Momentum-based Iterative Method (MIM) MIM [52] is an im-

proved version of BIM. MIM introduces a momentum term into the

iterative process to avoid BIM falling into local minimum and thus

improve the attack performance over BIM.

Projected Gradient Descent (PGD) PGD [50] is also a variant of

BIM. PGD randomly initializes the adversarial noise δfor γtimes

and conducts BIM-style attacks to generate γcandidates of adversar-

ial noise. Finally, the best one out of the γcandidates with the best

attack performance, will be chosen as the ﬁnal adversarial sample.

3. METHODOLOGY

3.1. AVASD Model – TalkNet

We adopt TalkNet [48] for our case study to characterize the adver-

sarial robustness of AVASD. TalkNet is one of the state-of-the-art

models for AVASD, which is fully end-to-end. TalkNet takes a se-

quence of video frames xvconsisting of cropped face sequences and

the corresponding audio sequences xaas inputs. The output prob-

ability denotes how likely the person is speaking in the given video

frame. TalkNet comprises a feature representation front-end, and a

speaker detection back-end classiﬁer, as shown in Fig. 1.(a). The

front-end consists of an audio temporal encoder and a video tem-

poral encoder to extract audio embeddings ea,i and visual embed-

dings ev,i for the ith frame. In the back-end, the audio and visual

embeddings are aligned via inter-modality cross-attention and then

concatenated to obtain the joint audio-visual embeddings za,i and

zv,i for the ith frame. Then, a self-attention network is applied af-

ter the cross-attention network to model the audio-visual temporal

information. Finally, a fully-connected layer with a softmax is im-

plemented to project the output of the self-attention network to a

sequence of ASD labels. The predicted label sequence is compared

with the ground-truth label sequence by cross-entropy loss (LCEav ):

LCEav =−1

t=1

(yt·logst+ (1 −yt)·log(1 −st)),(3)

where ytand stare the ground-truth and predicted score for the tth

frame, and Tis the total frames for one sample of video data. Dur-

ing training, TalkNet utilizes two additional predict heads for audio

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PUSH-PULL:CHARACTERIZINGTHEADVERSARIALROBUSTNESSFORAUDIO-VISUALACTIVESPEAKERDETECTIONXuanjunChen1,HaibinWu23,HelenMeng34y,Hung-yiLee2y,Jyh-ShingRogerJang1y1Department.ofComputerScienceInformationEngineering,NationalTaiwanUniversity2GraduateInstituteofCommunicationEngineering,NationalTaiwanUniversi...

展开>> 收起<<

PUSH-PULL CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION Xuanjun Chen1 Haibin Wu23 Helen Meng34y Hung-yi Lee2y Jyh-Shing Roger Jang1y.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PUSH-PULL CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION Xuanjun Chen1 Haibin Wu23 Helen Meng34y Hung-yi Lee2y Jyh-Shing Roger Jang1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: