Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

2025-05-06 0 0 604.54KB 7 页 10玖币

侵权投诉

Deep Spectro-temporal Artifacts for Detecting Synthesized

Speech

Xiaohui Liu

liuxiaohui2021@tju.edu.cn

Tianjin University

Tianjin, China

Meng Liu

Tianjin University

Tianjin, China

Lin Zhang

National Institute of Informatics

Tokyo, Japan

Linjuan Zhang

Taiyuan University of Technology

Taiyuan, China

Chang Zeng

National Institute of Informatics

Tokyo, Japan

Kai Li

Japan Advanced Institute of Science

and Technology

Nomi, Ishikawa, Japan

Nan Li

Tianjin University

Tianjin, China

Kong Aik Lee∗

Institute for Infocomm Research,

A★STAR

Singapore

Longbiao Wang∗

Tianjin University

Tianjin, China

Jianwu Dang

Japan Advanced Institute of Science

and Technology

Nomi, Ishikawa, Japan

ABSTRACT

The Audio Deep Synthesis Detection (ADD) Challenge has been

held to detect generated human-like speech. With our submitted

system, this paper provides an overall assessment of track 1 (Low-

quality Fake Audio Detection) and track 2 (Partially Fake Audio

Detection). In this paper, spectro-temporal artifacts were detected

using raw temporal signals, spectral features, as well as deep embed-

ding features. To address track 1, low-quality data augmentation, do-

main adaptation via netuning, and various complementary feature

information fusion were aggregated in our system. Furthermore, we

analyzed the clustering characteristics of subsystems with dierent

features by visualization method and explained the eectiveness of

our proposed greedy fusion strategy. As for track 2, frame transi-

tion and smoothing were detected using self-supervised learning

structure to capture the manipulation of PF attacks in the time

domain. We ranked 4th and 5th in track 1 and track 2, respectively.

CCS CONCEPTS

•Computing methodologies →Articial intelligence.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

DDAM ’22, October 14, 2022, Lisboa, Portugal.

ACM ISBN 978-1-4503-9496-3/22/10. . . $15.00

https://doi.org/10.1145/3552466.3556527

KEYWORDS

Audio Deep Synthesis Detection, Spectro-temporal, Domain Adap-

tation, Self-Supervised Learning, Frame transition, Greedy Fusion

ACM Reference Format:

Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li,

Nan Li, Kong Aik Lee

∗

, Longbiao Wang

∗

, and Jianwu Dang. 2022. Deep

Spectro-temporal Artifacts for Detecting Synthesized Speech. In Proceedings

of the 1st International Workshop on Deepfake Detection for Audio Multimedia

(DDAM ’22), October 14, 2022, Lisboa, Portugal. ACM, New York, NY, USA,

7 pages. https://doi.org/10.1145/3552466.3556527

1 INTRODUCTION

Automatic Speaker Verication (ASV) has been widely used in

security, human-computer interaction, and other elds. However,

similar to other biometric authentication methods, ASV systems

are vulnerable to attacks from speech generation technology [

Synthesized speech generated by deep learning algorithms is even

dicult to be distinguished by human ears, which poses a great

threat to ASV system. Thus, it is of great signicance to build a

countermeasure (CM) system that can eectively distinguish spoof

speech from genuine speech.

Previous ASVspoof challenges [

–

] have played a key role in

fostering spoofed speech detection research but ignored three prac-

tical scenarios: 1) diverse background noises and disturbances in

the fake audios, 2) partial spoofed segments in a real speech audio,

and 3) new algorithms of speech synthesis and voice conversion

are proposed rapidly. Besides, the existing anti-spoong databases

are mainly in English.

To address these problems, the rst Audio Deep Synthesis De-

tection Challenge (ADD 2022) [

] focuses on three dierent tracks

using a Mandarin corpus: Low-quality Fake Audio Detection (LF),

arXiv:2210.05254v1 [cs.SD] 11 Oct 2022

DDAM ’22, October 14, 2022, Lisboa, Portugal. Xiaohui Liu et al.

Table 1: Database details of the LF and PF track. Duration, Mean, Max, and Min are all measured in second.

Track Subset Total Genuine Fake Duration (s) Mean (s) Max (s) Min (s)

Training 27,084 3,012 24,072 85,355.51 3.15 60.01 0.86

Development 28,324 2,307 26,017 89,567.78 3.16 60.01 0.86

LF Adaptation1 1,000 300 700 3,627.24 3.63 60.01 1.13

Test1 109,199 - - 383,687.88 3.51 158.14 0.26

Adaptation2 1,052 0 1,052 4,568.56 4.34 13.42 1.30

Adaptation2★2,104 1,052 1,052 - - - -

Test2 100,625 - - 77,625.32 5.94 68.46 1.53

Partially Fake Audio Detection (PF), and Audio Fake Game (FG).

LF task consists of genuine and entirely fake utterances generated

using the text-to-speech (TTS) and voice conversion (VC) algo-

rithms but contains various background noises and disturbances.

PF task comprises genuine and partially fake utterances generated

by combining the original genuine utterances with real or synthe-

sized utterances. According to the challenge plan [

], all three tasks

share the same training and development datasets but dierent

adaptation and test datasets. Therefore, domain mismatch between

training and test is a severe challenge for all three tracks.

This paper describes our systems submitted to the ADD 2022. We

focused on the LF and PF track. For the LF track, the proposed low-

quality data augmentation, domain adaptation via netuning, and

greedy fusion of various complementary features were respectively

aggregated in our system. Then we submitted a primary system

for track 1. For the PF track, we conducted several popular Self-

supervised learning (SSL) models with two Bidirectional Long Short-

Term Memory (Bi-LSTM) layers [

] and one fully-connected layer

as our basic system. We ne-tuned all models with adaptation set to

reduce the mismatch between train set and test set. We submitted

a single system for track 2.

The rest of the paper is organized as follows: Section 2 presents

the details about ADD 2022 challenge and data. Section 3 describes

our single systems and their setups. Section 4 presents and analyzes

experimental results. In Section 5, we summarize our conclusions.

2 TASK DESCRIPTION AND DATA

The data for the challenge consists of training, development, adap-

tation, and test sets. Utterances in both training and development

sets are noiseless, where genuine utterances are selected from clean

AISHELL-3 [

] corpus, and fake utterances are generated by main-

stream speech synthesis and voice conversion systems based on

AISHELL-3. Data for the LF track and the PF track is distributed as

16 kHz, 16 bits per sample WAV les. The information of datasets

is summarized in Table 1. Since the organizers only provide fake

audio for the adaptation set of track 2, we randomly select 1052

(same number as fake) genuine utterances from the training and

development set to make the balance between fake and genuine,

and we call it as Adp2

★

, which will be used to compare SSL models.

For the LF track, the adaptation and test sets are composed of

genuine and fully fake utterances, but the test set is more com-

plicated. Since provided utterances in training and development

sets are noiseless, this track requires developing systems robust to

the noisy environments under the scenario of mismatch between

training and testing data.

PF track is a new topic proposed in 2021 [

]. As the name

suggests, the spoofed utterances in this scenario are partially faked.

Thus genuine parts that exist in the partially audio utterances

can degrade the performance of CM systems, making this track

more challenging. Besides, since fake utterances contain genuine

or spoofed clippings, the key point to detect partially faked audio

is to detect the frame transition or change in the temporal domain.

Metric for both tracks is Equal Error Rate (EER) [6].

3 SYSTEM DESCRIPTION

3.1 Features

As shown in Table 2, we explore various features in three cate-

gories: temporal raw waveform, hand-crafted features, and deep

embedding feature. The combination of these features is expected

to fully capture the spectro-temporal divergences between spoof

and genuine utterances.

Besides, we apply dierent trimming strategies of input for two

dierent tracks. For the LF track, an unied duration of 3s has been

applied for feature extraction. For the PF track, as entire utterance

information is of signicant importance, the entire utterance is

regarded as input

. Average Pooling layer is employed to handle

inputs with varied duration.

3.1.1 Temporal raw feature.

Considerable evidence [

] shows that avoiding the use of hand-

crafted features with end-to-end architecture may improve the

performance of anti-spoong systems. Therefore, we utilize raw

waveform as input following Sinc lters [12].

3.1.2 Hand-craed features.

Hand-crafted features are most commonly used in anti-spoong,

since they contain specic knowledge. In this subsection, the fol-

lowing applied or designed features are all online GPU features,

which accelerate the training speed.

Spectrogram

. The log-Spectrogram [

] is extracted with 50

ms frame length, 25 ms frame shift, 1024 FFT point and hamming

window. Both magnitude and phase spectrogram are extracted,

relatively.

1Actually, upper limitation duration is 15s because of computation limitation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepSpectro-temporalArtifactsforDetectingSynthesizedSpeechXiaohuiLiuliuxiaohui2021@tju.edu.cnTianjinUniversityTianjin,ChinaMengLiuTianjinUniversityTianjin,ChinaLinZhangNationalInstituteofInformaticsTokyo,JapanLinjuanZhangTaiyuanUniversityofTechnologyTaiyuan,ChinaChangZengNationalInstituteofInformati...

展开>> 收起<<

Deep Spectro-temporal Artifacts for Detecting Synthesized Speech.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: