
PUSH-PULL: CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL
ACTIVE SPEAKER DETECTION
Xuanjun Chen1∗, Haibin Wu23∗, Helen Meng34†, Hung-yi Lee2†, Jyh-Shing Roger Jang1†
1Department. of Computer Science Information Engineering, National Taiwan University
2Graduate Institute of Communication Engineering, National Taiwan University
3Centre for Perceptual and Interactive Intelligence, The Chinese University of Hong Kong
4Human-Computer Communications Laboratory, The Chinese University of Hong Kong
{r09922165, f07921092, hungyilee}@ntu.edu.tw, hmmeng@se.cuhk.edu.hk,jang@csie.ntu.edu.tw
ABSTRACT
Audio-visual active speaker detection (AVASD) is well-developed,
and now is an indispensable front-end for several multi-modal ap-
plications. However, to the best of our knowledge, the adversarial
robustness of AVASD models hasn’t been investigated, not to men-
tion the effective defense against such attacks. In this paper, we
are the first to reveal the vulnerability of AVASD models under
audio-only, visual-only, and audio-visual adversarial attacks through
extensive experiments. What’s more, we also propose a novel audio-
visual interaction loss (AVIL) for making attackers difficult to find
feasible adversarial examples under an allocated attack budget. The
loss aims at pushing the inter-class embeddings to be dispersed,
namely non-speech and speech clusters, sufficiently disentangled,
and pulling the intra-class embeddings as close as possible to keep
them compact. Experimental results show the AVIL outperforms the
adversarial training by 33.14 mAP (%) under multi-modal attacks.
Index Terms—Audio-visual active speaker detection, multi-
modal adversarial attack, adversarial robustness
1. INTRODUCTION
Active Speaker Detection (ASD) seeks to detect who is speaking
in a visual scene containing one or more speakers [1, 2]. Recently,
audio-visual ASD (AVASD), which integrates audio-visual informa-
tion by learning the relationship between speech and facial motion,
effectively improves the performance of ASD, and AVASD has be-
come more indispensable as a front-end for multi-modal applica-
tions. However, to the best of our knowledge, whether the AVASD
models are robust against adversarial attacks has not been investi-
gated previously, not to mention the effective defense method against
such multi-modal attacks.
Crafting indistinguishable adversarial noise, adding such noise to
clean samples to generate adversarial samples, and then manipulat-
ing the AI models by such samples, is called adversarial attack [3].
Previous adversarial attacks usually focus on single-modal applica-
tions. For visual-modal attacks, Szegedy et al. first propose to at-
tack state-of-the-art image classification models [3] in 2013. For
the speech modality, models including automatic speaker verifica-
tion (ASV) systems [4–18], anti-spoofing models for ASV [19–23],
and automatic speech recognition models [24–30] are also vulnera-
ble to adversarial attacks. For audio-visual learning, Li et al. [31]
∗Equal contribution. †Equal correspondence.
studied the audio-visual adversarial robustness of the general sound
event detection model but only considered single- or multi-modal
attacks under an attack method.
Given that AVASD is now ubiquitously implemented as a front-
end for a variety of multi-modal downstream models, the dangerous
adversarial noise may manipulate the AVASD front-end to commit
errors, which will accumulate and propagate to the downstream ap-
plications. Hence it is of high priority that we mitigate the adver-
sarial vulnerability of AVASD and ensure robustness against such
attacks. This paper investigates the susceptibility of AVASD models
to adversarial attacks and then proposes a novel defense method to
improve their robustness. Our contributions are summarized in two
folds: 1). To the best of our knowledge, this is the first work to reveal
the vulnerability of AVASD models under three kinds of attacks, in-
cluding audio-only, visual-only, and audio-visual adversarial attacks
by extensive experiments. 2). We also propose a novel audio-visual
interaction loss (AVIL), which aims at pushing the inter-class em-
beddings, namely the non-speech and speech clusters, sufficiently
disentangled, and pulling the intra-class embeddings as close as pos-
sible. Expanding the inter-class dispersion and enhancing the intra-
class compactness will make it difficult for attackers to find feasible
adversarial samples to go beyond the decision boundary within the
allocated attacking budget. The experimental results illustrate that
the brand-new audio-visual interaction loss effectively strengthens
the invulnerability of AVASD models.
2. BACKGROUND
2.1. Audio-Visual Active Speaker Detection
The ASD task has been studied using audio, video, or the fusion of
both. For audio, the voice activity detector [32, 33] is often used to
detect the presence of speech. However, in real-world scenarios, the
speech signal from the microphones is easily mixed with overlapping
speech and background noise, which will hinder the effectiveness of
voice activity detection. The visual part [34,35] mainly analyzes the
face and upper body of a person to determine whether the person is
speaking, but the performance is limited due to some non-speech ac-
tivities, e.g. licking lips, eating, and grinning. The audio-visual pro-
cessing refers to the combination of audio and visual parts [36, 37],
and allows learning across modalities about the relationship between
audio speech and facial motions. With valuable support from size-
able datasets, e.g. AVA-Active Speaker, and the AVA Challenge se-
ries launched since 2018, a variety of high-performance models for
arXiv:2210.00753v1 [cs.SD] 3 Oct 2022