MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION Xuanjun Chen12 Haibin Wu1 Chung-Che Wang2 Hung-yi Lee1 Jyh-Shing Roger Jang2 1Graduate Institute of Communication Engineering National Taiwan University

2025-05-02 0 0 480.02KB 5 页 10玖币
侵权投诉
MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION
Xuanjun Chen12, Haibin Wu1, Chung-Che Wang2, Hung-yi Lee1, Jyh-Shing Roger Jang2
1Graduate Institute of Communication Engineering, National Taiwan University
2Department of Computer Science and Information Engineering, National Taiwan University
{d12942018, f07921092, hungyilee}@ntu.edu.tw, geniusturtle6174@gmail.com, jang@mirlab.org
ABSTRACT
Audio-visual synchronization aims to determine whether the
mouth movements and speech in the video are synchronized. Vo-
caLiST reaches state-of-the-art performance by incorporating mul-
timodal Transformers to model audio-visual interact information.
However, it requires high computing resources, making it imprac-
tical for real-world applications. This paper proposed an MTD-
VocaLiST model, which is trained by our proposed multimodal
Transformer distillation (MTD) loss. MTD loss enables MTDVo-
caLiST model to deeply mimic the cross-attention distribution and
value-relation in the Transformer of VocaLiST. Additionally, we
harness uncertainty weighting to fully exploit the interaction in-
formation across all layers. Our proposed method is effective in
two aspects: From the distillation method perspective, MTD loss
outperforms other strong distillation baselines. From the distilled
model’s performance perspective: 1) MTDVocaLiST outperforms
similar-size SOTA models, SyncNet, and Perfect Match models by
15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of
VocaLiST by 83.52%, yet still maintaining similar performance.
Index TermsAudio-visual synchronization, Transformer dis-
tillation, knowledge distillation, lightweight model
1. INTRODUCTION
The audio-visual synchronization task is to determine whether the
mouth movements and speech in the video are synchronized. An
out-off-sync video may cause errors in many tasks, such as audio-
visual user authentication [1], dubbing [2], lip reading [3], active
speaker detection [4, 5], and audio-visual source separation [6–9].
An audio-visual synchronization model often acts as an indispens-
able front-end model for these downstream tasks. Various down-
stream tasks often run on mobile devices and require small model
sizes and fast inference speed. Smaller model sizes and faster infer-
ence speed are required to ensure user experience, such as correcting
the synchronization error of user-generated videos on mobile phones
or performing audio-visual user authentication on finance mobile ap-
plications [1]. To work with these applications, a lightweight audio-
visual synchronization model is worth exploring.
A typical framework for audio-visual synchronization tasks is
estimating the similarity between audio and visual segments. Sync-
Net [10] introduced a two-stream architecture to estimate the cross-
Equal correspondence. This work was supported by the National Sci-
ence and Technology Council, Taiwan (Grant no. NSTC 112-2634-F-002-005).
We also thank to National Center for High-performance Computing (NCHC)
of National Applied Research Laboratories (NARLabs) in Taiwan for provid-
ing computational and storage resources. Code has been made available at:
https://github.com/xjchenGit/MTDVocaLiST.
modal feature similarities, which is trained to maximize the simi-
larities between features of the in-sync audio-visual segments and
minimize the similarities between features of the out-of-sync audio-
visual segments. Perfect Match (PM) [3, 11] optimizes the relative
similarities between multiple audio features and one visual feature
with a multi-way matching objective function. Audio-Visual Syn-
chronisation with Transformers (AVST) [12] and VocaLiST [13], the
current state-of-the-art (SOTA) models, which incorporate Trans-
formers [14] to learn the multi-modal interaction and classify di-
rectly if a given audio-visual pair is synchronized or not, resulting in
an excellent performance. However, both AVST and VocaLiST re-
quire large memory and high computing costs, making these models
unsuitable for edge-device computation.
In this paper, we propose to distill a small-size version of Vo-
caLiST, namely MTDVocaLiST, which is distilled by mimicking the
multimodal Transformer behavior of VocaLiST. Next, we propose to
employ uncertainty weighting, which allows us to assess the vary-
ing significance of Transformer behavior across different layers, re-
sulting in an enhanced MTDVocaLiST model. To our knowledge,
this is the first attempt to distill a model by mimicking multimodal
Transformer behavior for the audio-visual task. Our model outper-
forms similar-size state-of-the-art models, SyncNet and PM mod-
els by 15.65% and 3.35%. MTDVocaLiST significantly reduces Vo-
caLiST’s size by 83.52% while still maintaining competitive perfor-
mance comparable to that of VocaLiST.
2. BACKGROUND
2.1. VocaLiST
VocaLiST [13] is a SOTA audio-visual synchronization model. The
input of VocaLiST is a sequence of visuals and its corresponding au-
dio features. The output is about to classify whether a given audio-
visual pair is in sync. VocaLiST consists of an audio-visual front-
end and a synchronization back-end. The audio-visual front end
extracts audio and visual features. The synchronization back-end
comprises three cross-modal Transformer encoders, namely audio-
to-visual (AV) Transformer, visual-to-audio (VA) Transformer, and
Fusion Transformer. Each Transformer block has 4 layers. The
core part of the cross-modal Transformer is the cross-attention layer,
whose input has queries, keys, and values. In VocaLiST, the AV
Transformer uses audio for queries and visual data for keys and val-
ues, while the VA Transformer does the opposite. The Fusion Trans-
former merges these, taking AV output for queries and VA output for
keys and values. Its output undergoes max-pooling over time and is
activated with a tanh function. A fully connected layer then classifies
if voice and lip motion are synchronized, using binary cross-entropy
loss for optimization.
arXiv:2210.15563v3 [cs.CV] 18 Mar 2024
摘要:

MULTIMODALTRANSFORMERDISTILLATIONFORAUDIO-VISUALSYNCHRONIZATIONXuanjunChen12,HaibinWu1,Chung-CheWang2,Hung-yiLee1†,Jyh-ShingRogerJang2†1GraduateInstituteofCommunicationEngineering,NationalTaiwanUniversity2DepartmentofComputerScienceandInformationEngineering,NationalTaiwanUniversity{d12942018,f079210...

展开>> 收起<<
MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION Xuanjun Chen12 Haibin Wu1 Chung-Che Wang2 Hung-yi Lee1 Jyh-Shing Roger Jang2 1Graduate Institute of Communication Engineering National Taiwan University.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:480.02KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注