PUSH-PULL CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION Xuanjun Chen1 Haibin Wu23 Helen Meng34y Hung-yi Lee2y Jyh-Shing Roger Jang1y

2025-05-02 0 0 1.87MB 8 页 10玖币
侵权投诉
PUSH-PULL: CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL
ACTIVE SPEAKER DETECTION
Xuanjun Chen1, Haibin Wu23, Helen Meng34, Hung-yi Lee2, Jyh-Shing Roger Jang1
1Department. of Computer Science Information Engineering, National Taiwan University
2Graduate Institute of Communication Engineering, National Taiwan University
3Centre for Perceptual and Interactive Intelligence, The Chinese University of Hong Kong
4Human-Computer Communications Laboratory, The Chinese University of Hong Kong
{r09922165, f07921092, hungyilee}@ntu.edu.tw, hmmeng@se.cuhk.edu.hk,jang@csie.ntu.edu.tw
ABSTRACT
Audio-visual active speaker detection (AVASD) is well-developed,
and now is an indispensable front-end for several multi-modal ap-
plications. However, to the best of our knowledge, the adversarial
robustness of AVASD models hasn’t been investigated, not to men-
tion the effective defense against such attacks. In this paper, we
are the first to reveal the vulnerability of AVASD models under
audio-only, visual-only, and audio-visual adversarial attacks through
extensive experiments. What’s more, we also propose a novel audio-
visual interaction loss (AVIL) for making attackers difficult to find
feasible adversarial examples under an allocated attack budget. The
loss aims at pushing the inter-class embeddings to be dispersed,
namely non-speech and speech clusters, sufficiently disentangled,
and pulling the intra-class embeddings as close as possible to keep
them compact. Experimental results show the AVIL outperforms the
adversarial training by 33.14 mAP (%) under multi-modal attacks.
Index TermsAudio-visual active speaker detection, multi-
modal adversarial attack, adversarial robustness
1. INTRODUCTION
Active Speaker Detection (ASD) seeks to detect who is speaking
in a visual scene containing one or more speakers [1, 2]. Recently,
audio-visual ASD (AVASD), which integrates audio-visual informa-
tion by learning the relationship between speech and facial motion,
effectively improves the performance of ASD, and AVASD has be-
come more indispensable as a front-end for multi-modal applica-
tions. However, to the best of our knowledge, whether the AVASD
models are robust against adversarial attacks has not been investi-
gated previously, not to mention the effective defense method against
such multi-modal attacks.
Crafting indistinguishable adversarial noise, adding such noise to
clean samples to generate adversarial samples, and then manipulat-
ing the AI models by such samples, is called adversarial attack [3].
Previous adversarial attacks usually focus on single-modal applica-
tions. For visual-modal attacks, Szegedy et al. first propose to at-
tack state-of-the-art image classification models [3] in 2013. For
the speech modality, models including automatic speaker verifica-
tion (ASV) systems [4–18], anti-spoofing models for ASV [19–23],
and automatic speech recognition models [24–30] are also vulnera-
ble to adversarial attacks. For audio-visual learning, Li et al. [31]
Equal contribution. Equal correspondence.
studied the audio-visual adversarial robustness of the general sound
event detection model but only considered single- or multi-modal
attacks under an attack method.
Given that AVASD is now ubiquitously implemented as a front-
end for a variety of multi-modal downstream models, the dangerous
adversarial noise may manipulate the AVASD front-end to commit
errors, which will accumulate and propagate to the downstream ap-
plications. Hence it is of high priority that we mitigate the adver-
sarial vulnerability of AVASD and ensure robustness against such
attacks. This paper investigates the susceptibility of AVASD models
to adversarial attacks and then proposes a novel defense method to
improve their robustness. Our contributions are summarized in two
folds: 1). To the best of our knowledge, this is the first work to reveal
the vulnerability of AVASD models under three kinds of attacks, in-
cluding audio-only, visual-only, and audio-visual adversarial attacks
by extensive experiments. 2). We also propose a novel audio-visual
interaction loss (AVIL), which aims at pushing the inter-class em-
beddings, namely the non-speech and speech clusters, sufficiently
disentangled, and pulling the intra-class embeddings as close as pos-
sible. Expanding the inter-class dispersion and enhancing the intra-
class compactness will make it difficult for attackers to find feasible
adversarial samples to go beyond the decision boundary within the
allocated attacking budget. The experimental results illustrate that
the brand-new audio-visual interaction loss effectively strengthens
the invulnerability of AVASD models.
2. BACKGROUND
2.1. Audio-Visual Active Speaker Detection
The ASD task has been studied using audio, video, or the fusion of
both. For audio, the voice activity detector [32, 33] is often used to
detect the presence of speech. However, in real-world scenarios, the
speech signal from the microphones is easily mixed with overlapping
speech and background noise, which will hinder the effectiveness of
voice activity detection. The visual part [34,35] mainly analyzes the
face and upper body of a person to determine whether the person is
speaking, but the performance is limited due to some non-speech ac-
tivities, e.g. licking lips, eating, and grinning. The audio-visual pro-
cessing refers to the combination of audio and visual parts [36, 37],
and allows learning across modalities about the relationship between
audio speech and facial motions. With valuable support from size-
able datasets, e.g. AVA-Active Speaker, and the AVA Challenge se-
ries launched since 2018, a variety of high-performance models for
arXiv:2210.00753v1 [cs.SD] 3 Oct 2022
Audio
Temporal
Encoder
Visual
Temporal
Encoder
Cross-attention
Cross-attention
Self-attention
Feature Extraction Frontend Speaker Detection Backend
...
...
...
...
AVASD
Model
AVASD
Model
1 1 0 ... 0
0 0 1 ... 1
(a) (b)
Fig. 1. (a) The TalkNet framework. xaand xvare the audio and visual inputs, respectively. denotes the concatenation procedure. LCEa,
LCEvand LCEav are the cross entropy losses for audio-only prediction head, visual-only prediction head, and audio-visual prediction head,
respectively. (b) The audio-visual attack framework for AVASD. xaand xvare the audio and visual samples respectively, yis the ground-
truth for the multi-sensory input {xa, xv}.δaand δvare the adversarial perturbations for xaand xv, respectively. ˜yis the prediction for the
adversarial samples {˜xa,˜xv}. The adversarial attack aims at maximizing the difference between yand ˜y.
AVASD have emerged recently [38–48]. In real-world user authenti-
cation systems, AVASD can be used as a front-end task to assure se-
curity verification for speaker verification [49]. For AVASD system,
there are four typical cases, such as speech without target speaker, no
audible speaker, speaker without speech and speech with the target
speaker. Only the speech with target speaker is labeled as speaking.
Attackers possibly use some single modal attack methods or com-
bine them to make AVASD produce wrong predictions in the other
three cases, which is dangerous. However, people have not yet seen
investigations on the adversarial robustness of AVASD model.
2.2. Adversarial Attacks
Adversarial attack is to manipulate a well-trained model to give
wrong predictions by an adversarial sample, which is imperceptible
by humans, compared with the original (unmodified) counterpart.
Mathematically, given a clean sample xand the ground-truth label
y, attack algorithms seek to find a sufficiently small perturbation δ
such that: ˜x=x+δ, where ˜xis the adversarial sample that can fool
the model to produce the wrong prediction ˜y. We can find a suitable
δby solving the following objective function:
arg max
δ
L(˜x, y, θ),
s.t.||δ||p,
(1)
where L(·)denotes the objective function pre-defined by the attack-
ers, which is usually set to maximize the difference between yand
the model’s final prediction given ˜x,is the allowed perturbation
budget, and || · ||pdenotes the p-norm, which is usually considered
to be a l-ball or l2-ball centered at x. We evaluate the AVASD
models’ vulnerability with l-boundary adversarial noise, as it is
widely used as a standard evaluation boundary for adversarial ro-
bustness [50]. To solve the optimization problem as shown in Equa-
tion 1, we choose three widely used attack methods to evaluate the
robustness of AVASD models, and the details are summarized below.
Basic Iterative Method (BIM) BIM [51] is a method with iterative
updates as follows:
xm=clip(xm1+α·sign(xm1L(xm1, y, θ))),
for m = 1, ..., M, (2)
where x0starts from the original sample x,αis the step size, Mis
the number of iterations and the clip(·)function applies element-
wise clipping to make ||xm1x||,  0R. The per-
turbed example xMis the final adversarial example.
Momentum-based Iterative Method (MIM) MIM [52] is an im-
proved version of BIM. MIM introduces a momentum term into the
iterative process to avoid BIM falling into local minimum and thus
improve the attack performance over BIM.
Projected Gradient Descent (PGD) PGD [50] is also a variant of
BIM. PGD randomly initializes the adversarial noise δfor γtimes
and conducts BIM-style attacks to generate γcandidates of adversar-
ial noise. Finally, the best one out of the γcandidates with the best
attack performance, will be chosen as the final adversarial sample.
3. METHODOLOGY
3.1. AVASD Model – TalkNet
We adopt TalkNet [48] for our case study to characterize the adver-
sarial robustness of AVASD. TalkNet is one of the state-of-the-art
models for AVASD, which is fully end-to-end. TalkNet takes a se-
quence of video frames xvconsisting of cropped face sequences and
the corresponding audio sequences xaas inputs. The output prob-
ability denotes how likely the person is speaking in the given video
frame. TalkNet comprises a feature representation front-end, and a
speaker detection back-end classifier, as shown in Fig. 1.(a). The
front-end consists of an audio temporal encoder and a video tem-
poral encoder to extract audio embeddings ea,i and visual embed-
dings ev,i for the ith frame. In the back-end, the audio and visual
embeddings are aligned via inter-modality cross-attention and then
concatenated to obtain the joint audio-visual embeddings za,i and
zv,i for the ith frame. Then, a self-attention network is applied af-
ter the cross-attention network to model the audio-visual temporal
information. Finally, a fully-connected layer with a softmax is im-
plemented to project the output of the self-attention network to a
sequence of ASD labels. The predicted label sequence is compared
with the ground-truth label sequence by cross-entropy loss (LCEav ):
LCEav =1
T
T
X
t=1
(yt·logst+ (1 yt)·log(1 st)),(3)
where ytand stare the ground-truth and predicted score for the tth
frame, and Tis the total frames for one sample of video data. Dur-
ing training, TalkNet utilizes two additional predict heads for audio
摘要:

PUSH-PULL:CHARACTERIZINGTHEADVERSARIALROBUSTNESSFORAUDIO-VISUALACTIVESPEAKERDETECTIONXuanjunChen1,HaibinWu23,HelenMeng34y,Hung-yiLee2y,Jyh-ShingRogerJang1y1Department.ofComputerScienceInformationEngineering,NationalTaiwanUniversity2GraduateInstituteofCommunicationEngineering,NationalTaiwanUniversi...

展开>> 收起<<
PUSH-PULL CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION Xuanjun Chen1 Haibin Wu23 Helen Meng34y Hung-yi Lee2y Jyh-Shing Roger Jang1y.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.87MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注