
MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION
Xuanjun Chen12, Haibin Wu1, Chung-Che Wang2, Hung-yi Lee1†, Jyh-Shing Roger Jang2†
1Graduate Institute of Communication Engineering, National Taiwan University
2Department of Computer Science and Information Engineering, National Taiwan University
{d12942018, f07921092, hungyilee}@ntu.edu.tw, geniusturtle6174@gmail.com, jang@mirlab.org
ABSTRACT
Audio-visual synchronization aims to determine whether the
mouth movements and speech in the video are synchronized. Vo-
caLiST reaches state-of-the-art performance by incorporating mul-
timodal Transformers to model audio-visual interact information.
However, it requires high computing resources, making it imprac-
tical for real-world applications. This paper proposed an MTD-
VocaLiST model, which is trained by our proposed multimodal
Transformer distillation (MTD) loss. MTD loss enables MTDVo-
caLiST model to deeply mimic the cross-attention distribution and
value-relation in the Transformer of VocaLiST. Additionally, we
harness uncertainty weighting to fully exploit the interaction in-
formation across all layers. Our proposed method is effective in
two aspects: From the distillation method perspective, MTD loss
outperforms other strong distillation baselines. From the distilled
model’s performance perspective: 1) MTDVocaLiST outperforms
similar-size SOTA models, SyncNet, and Perfect Match models by
15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of
VocaLiST by 83.52%, yet still maintaining similar performance.
Index Terms—Audio-visual synchronization, Transformer dis-
tillation, knowledge distillation, lightweight model
1. INTRODUCTION
The audio-visual synchronization task is to determine whether the
mouth movements and speech in the video are synchronized. An
out-off-sync video may cause errors in many tasks, such as audio-
visual user authentication [1], dubbing [2], lip reading [3], active
speaker detection [4, 5], and audio-visual source separation [6–9].
An audio-visual synchronization model often acts as an indispens-
able front-end model for these downstream tasks. Various down-
stream tasks often run on mobile devices and require small model
sizes and fast inference speed. Smaller model sizes and faster infer-
ence speed are required to ensure user experience, such as correcting
the synchronization error of user-generated videos on mobile phones
or performing audio-visual user authentication on finance mobile ap-
plications [1]. To work with these applications, a lightweight audio-
visual synchronization model is worth exploring.
A typical framework for audio-visual synchronization tasks is
estimating the similarity between audio and visual segments. Sync-
Net [10] introduced a two-stream architecture to estimate the cross-
†Equal correspondence. This work was supported by the National Sci-
ence and Technology Council, Taiwan (Grant no. NSTC 112-2634-F-002-005).
We also thank to National Center for High-performance Computing (NCHC)
of National Applied Research Laboratories (NARLabs) in Taiwan for provid-
ing computational and storage resources. Code has been made available at:
https://github.com/xjchenGit/MTDVocaLiST.
modal feature similarities, which is trained to maximize the simi-
larities between features of the in-sync audio-visual segments and
minimize the similarities between features of the out-of-sync audio-
visual segments. Perfect Match (PM) [3, 11] optimizes the relative
similarities between multiple audio features and one visual feature
with a multi-way matching objective function. Audio-Visual Syn-
chronisation with Transformers (AVST) [12] and VocaLiST [13], the
current state-of-the-art (SOTA) models, which incorporate Trans-
formers [14] to learn the multi-modal interaction and classify di-
rectly if a given audio-visual pair is synchronized or not, resulting in
an excellent performance. However, both AVST and VocaLiST re-
quire large memory and high computing costs, making these models
unsuitable for edge-device computation.
In this paper, we propose to distill a small-size version of Vo-
caLiST, namely MTDVocaLiST, which is distilled by mimicking the
multimodal Transformer behavior of VocaLiST. Next, we propose to
employ uncertainty weighting, which allows us to assess the vary-
ing significance of Transformer behavior across different layers, re-
sulting in an enhanced MTDVocaLiST model. To our knowledge,
this is the first attempt to distill a model by mimicking multimodal
Transformer behavior for the audio-visual task. Our model outper-
forms similar-size state-of-the-art models, SyncNet and PM mod-
els by 15.65% and 3.35%. MTDVocaLiST significantly reduces Vo-
caLiST’s size by 83.52% while still maintaining competitive perfor-
mance comparable to that of VocaLiST.
2. BACKGROUND
2.1. VocaLiST
VocaLiST [13] is a SOTA audio-visual synchronization model. The
input of VocaLiST is a sequence of visuals and its corresponding au-
dio features. The output is about to classify whether a given audio-
visual pair is in sync. VocaLiST consists of an audio-visual front-
end and a synchronization back-end. The audio-visual front end
extracts audio and visual features. The synchronization back-end
comprises three cross-modal Transformer encoders, namely audio-
to-visual (AV) Transformer, visual-to-audio (VA) Transformer, and
Fusion Transformer. Each Transformer block has 4 layers. The
core part of the cross-modal Transformer is the cross-attention layer,
whose input has queries, keys, and values. In VocaLiST, the AV
Transformer uses audio for queries and visual data for keys and val-
ues, while the VA Transformer does the opposite. The Fusion Trans-
former merges these, taking AV output for queries and VA output for
keys and values. Its output undergoes max-pooling over time and is
activated with a tanh function. A fully connected layer then classifies
if voice and lip motion are synchronized, using binary cross-entropy
loss for optimization.
arXiv:2210.15563v3 [cs.CV] 18 Mar 2024