MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION Xuanjun Chen12 Haibin Wu1 Chung-Che Wang2 Hung-yi Lee1 Jyh-Shing Roger Jang2 1Graduate Institute of Communication Engineering National Taiwan University

2025-05-02 0 0 480.02KB 5 页 10玖币

侵权投诉

MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION

Xuanjun Chen12, Haibin Wu1, Chung-Che Wang2, Hung-yi Lee1†, Jyh-Shing Roger Jang2†

1Graduate Institute of Communication Engineering, National Taiwan University

2Department of Computer Science and Information Engineering, National Taiwan University

{d12942018, f07921092, hungyilee}@ntu.edu.tw, geniusturtle6174@gmail.com, jang@mirlab.org

ABSTRACT

Audio-visual synchronization aims to determine whether the

mouth movements and speech in the video are synchronized. Vo-

caLiST reaches state-of-the-art performance by incorporating mul-

timodal Transformers to model audio-visual interact information.

However, it requires high computing resources, making it imprac-

tical for real-world applications. This paper proposed an MTD-

VocaLiST model, which is trained by our proposed multimodal

Transformer distillation (MTD) loss. MTD loss enables MTDVo-

caLiST model to deeply mimic the cross-attention distribution and

value-relation in the Transformer of VocaLiST. Additionally, we

harness uncertainty weighting to fully exploit the interaction in-

formation across all layers. Our proposed method is effective in

two aspects: From the distillation method perspective, MTD loss

outperforms other strong distillation baselines. From the distilled

model’s performance perspective: 1) MTDVocaLiST outperforms

similar-size SOTA models, SyncNet, and Perfect Match models by

15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of

VocaLiST by 83.52%, yet still maintaining similar performance.

Index Terms—Audio-visual synchronization, Transformer dis-

tillation, knowledge distillation, lightweight model

1. INTRODUCTION

The audio-visual synchronization task is to determine whether the

mouth movements and speech in the video are synchronized. An

out-off-sync video may cause errors in many tasks, such as audio-

visual user authentication [1], dubbing [2], lip reading [3], active

speaker detection [4, 5], and audio-visual source separation [6–9].

An audio-visual synchronization model often acts as an indispens-

able front-end model for these downstream tasks. Various down-

stream tasks often run on mobile devices and require small model

sizes and fast inference speed. Smaller model sizes and faster infer-

ence speed are required to ensure user experience, such as correcting

the synchronization error of user-generated videos on mobile phones

or performing audio-visual user authentication on ﬁnance mobile ap-

plications [1]. To work with these applications, a lightweight audio-

visual synchronization model is worth exploring.

A typical framework for audio-visual synchronization tasks is

estimating the similarity between audio and visual segments. Sync-

Net [10] introduced a two-stream architecture to estimate the cross-

†Equal correspondence. This work was supported by the National Sci-

ence and Technology Council, Taiwan (Grant no. NSTC 112-2634-F-002-005).

We also thank to National Center for High-performance Computing (NCHC)

of National Applied Research Laboratories (NARLabs) in Taiwan for provid-

ing computational and storage resources. Code has been made available at:

https://github.com/xjchenGit/MTDVocaLiST.

modal feature similarities, which is trained to maximize the simi-

larities between features of the in-sync audio-visual segments and

minimize the similarities between features of the out-of-sync audio-

visual segments. Perfect Match (PM) [3, 11] optimizes the relative

similarities between multiple audio features and one visual feature

with a multi-way matching objective function. Audio-Visual Syn-

chronisation with Transformers (AVST) [12] and VocaLiST [13], the

current state-of-the-art (SOTA) models, which incorporate Trans-

formers [14] to learn the multi-modal interaction and classify di-

rectly if a given audio-visual pair is synchronized or not, resulting in

an excellent performance. However, both AVST and VocaLiST re-

quire large memory and high computing costs, making these models

unsuitable for edge-device computation.

In this paper, we propose to distill a small-size version of Vo-

caLiST, namely MTDVocaLiST, which is distilled by mimicking the

multimodal Transformer behavior of VocaLiST. Next, we propose to

employ uncertainty weighting, which allows us to assess the vary-

ing signiﬁcance of Transformer behavior across different layers, re-

sulting in an enhanced MTDVocaLiST model. To our knowledge,

this is the ﬁrst attempt to distill a model by mimicking multimodal

Transformer behavior for the audio-visual task. Our model outper-

forms similar-size state-of-the-art models, SyncNet and PM mod-

els by 15.65% and 3.35%. MTDVocaLiST signiﬁcantly reduces Vo-

caLiST’s size by 83.52% while still maintaining competitive perfor-

mance comparable to that of VocaLiST.

2. BACKGROUND

2.1. VocaLiST

VocaLiST [13] is a SOTA audio-visual synchronization model. The

input of VocaLiST is a sequence of visuals and its corresponding au-

dio features. The output is about to classify whether a given audio-

visual pair is in sync. VocaLiST consists of an audio-visual front-

end and a synchronization back-end. The audio-visual front end

extracts audio and visual features. The synchronization back-end

comprises three cross-modal Transformer encoders, namely audio-

to-visual (AV) Transformer, visual-to-audio (VA) Transformer, and

Fusion Transformer. Each Transformer block has 4 layers. The

core part of the cross-modal Transformer is the cross-attention layer,

whose input has queries, keys, and values. In VocaLiST, the AV

Transformer uses audio for queries and visual data for keys and val-

ues, while the VA Transformer does the opposite. The Fusion Trans-

former merges these, taking AV output for queries and VA output for

keys and values. Its output undergoes max-pooling over time and is

activated with a tanh function. A fully connected layer then classiﬁes

if voice and lip motion are synchronized, using binary cross-entropy

loss for optimization.

arXiv:2210.15563v3 [cs.CV] 18 Mar 2024

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MULTIMODALTRANSFORMERDISTILLATIONFORAUDIO-VISUALSYNCHRONIZATIONXuanjunChen12,HaibinWu1,Chung-CheWang2,Hung-yiLee1†,Jyh-ShingRogerJang2†1GraduateInstituteofCommunicationEngineering,NationalTaiwanUniversity2DepartmentofComputerScienceandInformationEngineering,NationalTaiwanUniversity{d12942018,f079210...

展开>> 收起<<

MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION Xuanjun Chen12 Haibin Wu1 Chung-Che Wang2 Hung-yi Lee1 Jyh-Shing Roger Jang2 1Graduate Institute of Communication Engineering National Taiwan University.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION Xuanjun Chen12 Haibin Wu1 Chung-Che Wang2 Hung-yi Lee1 Jyh-Shing Roger Jang2 1Graduate Institute of Communication Engineering National Taiwan University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: