Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

2025-05-06 0 0 604.54KB 7 页 10玖币
侵权投诉
Deep Spectro-temporal Artifacts for Detecting Synthesized
Speech
Xiaohui Liu
liuxiaohui2021@tju.edu.cn
Tianjin University
Tianjin, China
Meng Liu
Tianjin University
Tianjin, China
Lin Zhang
National Institute of Informatics
Tokyo, Japan
Linjuan Zhang
Taiyuan University of Technology
Taiyuan, China
Chang Zeng
National Institute of Informatics
Tokyo, Japan
Kai Li
Japan Advanced Institute of Science
and Technology
Nomi, Ishikawa, Japan
Nan Li
Tianjin University
Tianjin, China
Kong Aik Lee
Institute for Infocomm Research,
ASTAR
Singapore
Longbiao Wang
Tianjin University
Tianjin, China
Jianwu Dang
Japan Advanced Institute of Science
and Technology
Nomi, Ishikawa, Japan
ABSTRACT
The Audio Deep Synthesis Detection (ADD) Challenge has been
held to detect generated human-like speech. With our submitted
system, this paper provides an overall assessment of track 1 (Low-
quality Fake Audio Detection) and track 2 (Partially Fake Audio
Detection). In this paper, spectro-temporal artifacts were detected
using raw temporal signals, spectral features, as well as deep embed-
ding features. To address track 1, low-quality data augmentation, do-
main adaptation via netuning, and various complementary feature
information fusion were aggregated in our system. Furthermore, we
analyzed the clustering characteristics of subsystems with dierent
features by visualization method and explained the eectiveness of
our proposed greedy fusion strategy. As for track 2, frame transi-
tion and smoothing were detected using self-supervised learning
structure to capture the manipulation of PF attacks in the time
domain. We ranked 4th and 5th in track 1 and track 2, respectively.
CCS CONCEPTS
Computing methodologies Articial intelligence.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
DDAM ’22, October 14, 2022, Lisboa, Portugal.
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9496-3/22/10. . . $15.00
https://doi.org/10.1145/3552466.3556527
KEYWORDS
Audio Deep Synthesis Detection, Spectro-temporal, Domain Adap-
tation, Self-Supervised Learning, Frame transition, Greedy Fusion
ACM Reference Format:
Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li,
Nan Li, Kong Aik Lee
, Longbiao Wang
, and Jianwu Dang. 2022. Deep
Spectro-temporal Artifacts for Detecting Synthesized Speech. In Proceedings
of the 1st International Workshop on Deepfake Detection for Audio Multimedia
(DDAM ’22), October 14, 2022, Lisboa, Portugal. ACM, New York, NY, USA,
7 pages. https://doi.org/10.1145/3552466.3556527
1 INTRODUCTION
Automatic Speaker Verication (ASV) has been widely used in
security, human-computer interaction, and other elds. However,
similar to other biometric authentication methods, ASV systems
are vulnerable to attacks from speech generation technology [
1
].
Synthesized speech generated by deep learning algorithms is even
dicult to be distinguished by human ears, which poses a great
threat to ASV system. Thus, it is of great signicance to build a
countermeasure (CM) system that can eectively distinguish spoof
speech from genuine speech.
Previous ASVspoof challenges [
2
5
] have played a key role in
fostering spoofed speech detection research but ignored three prac-
tical scenarios: 1) diverse background noises and disturbances in
the fake audios, 2) partial spoofed segments in a real speech audio,
and 3) new algorithms of speech synthesis and voice conversion
are proposed rapidly. Besides, the existing anti-spoong databases
are mainly in English.
To address these problems, the rst Audio Deep Synthesis De-
tection Challenge (ADD 2022) [
6
] focuses on three dierent tracks
using a Mandarin corpus: Low-quality Fake Audio Detection (LF),
arXiv:2210.05254v1 [cs.SD] 11 Oct 2022
DDAM ’22, October 14, 2022, Lisboa, Portugal. Xiaohui Liu et al.
Table 1: Database details of the LF and PF track. Duration, Mean, Max, and Min are all measured in second.
Track Subset Total Genuine Fake Duration (s) Mean (s) Max (s) Min (s)
Training 27,084 3,012 24,072 85,355.51 3.15 60.01 0.86
Development 28,324 2,307 26,017 89,567.78 3.16 60.01 0.86
LF Adaptation1 1,000 300 700 3,627.24 3.63 60.01 1.13
Test1 109,199 - - 383,687.88 3.51 158.14 0.26
PF
Adaptation2 1,052 0 1,052 4,568.56 4.34 13.42 1.30
Adaptation22,104 1,052 1,052 - - - -
Test2 100,625 - - 77,625.32 5.94 68.46 1.53
Partially Fake Audio Detection (PF), and Audio Fake Game (FG).
LF task consists of genuine and entirely fake utterances generated
using the text-to-speech (TTS) and voice conversion (VC) algo-
rithms but contains various background noises and disturbances.
PF task comprises genuine and partially fake utterances generated
by combining the original genuine utterances with real or synthe-
sized utterances. According to the challenge plan [
6
], all three tasks
share the same training and development datasets but dierent
adaptation and test datasets. Therefore, domain mismatch between
training and test is a severe challenge for all three tracks.
This paper describes our systems submitted to the ADD 2022. We
focused on the LF and PF track. For the LF track, the proposed low-
quality data augmentation, domain adaptation via netuning, and
greedy fusion of various complementary features were respectively
aggregated in our system. Then we submitted a primary system
for track 1. For the PF track, we conducted several popular Self-
supervised learning (SSL) models with two Bidirectional Long Short-
Term Memory (Bi-LSTM) layers [
7
] and one fully-connected layer
as our basic system. We ne-tuned all models with adaptation set to
reduce the mismatch between train set and test set. We submitted
a single system for track 2.
The rest of the paper is organized as follows: Section 2 presents
the details about ADD 2022 challenge and data. Section 3 describes
our single systems and their setups. Section 4 presents and analyzes
experimental results. In Section 5, we summarize our conclusions.
2 TASK DESCRIPTION AND DATA
The data for the challenge consists of training, development, adap-
tation, and test sets. Utterances in both training and development
sets are noiseless, where genuine utterances are selected from clean
AISHELL-3 [
8
] corpus, and fake utterances are generated by main-
stream speech synthesis and voice conversion systems based on
AISHELL-3. Data for the LF track and the PF track is distributed as
16 kHz, 16 bits per sample WAV les. The information of datasets
is summarized in Table 1. Since the organizers only provide fake
audio for the adaptation set of track 2, we randomly select 1052
(same number as fake) genuine utterances from the training and
development set to make the balance between fake and genuine,
and we call it as Adp2
, which will be used to compare SSL models.
For the LF track, the adaptation and test sets are composed of
genuine and fully fake utterances, but the test set is more com-
plicated. Since provided utterances in training and development
sets are noiseless, this track requires developing systems robust to
the noisy environments under the scenario of mismatch between
training and testing data.
PF track is a new topic proposed in 2021 [
9
,
10
]. As the name
suggests, the spoofed utterances in this scenario are partially faked.
Thus genuine parts that exist in the partially audio utterances
can degrade the performance of CM systems, making this track
more challenging. Besides, since fake utterances contain genuine
or spoofed clippings, the key point to detect partially faked audio
is to detect the frame transition or change in the temporal domain.
Metric for both tracks is Equal Error Rate (EER) [6].
3 SYSTEM DESCRIPTION
3.1 Features
As shown in Table 2, we explore various features in three cate-
gories: temporal raw waveform, hand-crafted features, and deep
embedding feature. The combination of these features is expected
to fully capture the spectro-temporal divergences between spoof
and genuine utterances.
Besides, we apply dierent trimming strategies of input for two
dierent tracks. For the LF track, an unied duration of 3s has been
applied for feature extraction. For the PF track, as entire utterance
information is of signicant importance, the entire utterance is
regarded as input
1
. Average Pooling layer is employed to handle
inputs with varied duration.
3.1.1 Temporal raw feature.
Considerable evidence [
11
] shows that avoiding the use of hand-
crafted features with end-to-end architecture may improve the
performance of anti-spoong systems. Therefore, we utilize raw
waveform as input following Sinc lters [12].
3.1.2 Hand-craed features.
Hand-crafted features are most commonly used in anti-spoong,
since they contain specic knowledge. In this subsection, the fol-
lowing applied or designed features are all online GPU features,
which accelerate the training speed.
Spectrogram
. The log-Spectrogram [
13
] is extracted with 50
ms frame length, 25 ms frame shift, 1024 FFT point and hamming
window. Both magnitude and phase spectrogram are extracted,
relatively.
1Actually, upper limitation duration is 15s because of computation limitation
摘要:

DeepSpectro-temporalArtifactsforDetectingSynthesizedSpeechXiaohuiLiuliuxiaohui2021@tju.edu.cnTianjinUniversityTianjin,ChinaMengLiuTianjinUniversityTianjin,ChinaLinZhangNationalInstituteofInformaticsTokyo,JapanLinjuanZhangTaiyuanUniversityofTechnologyTaiyuan,ChinaChangZengNationalInstituteofInformati...

展开>> 收起<<
Deep Spectro-temporal Artifacts for Detecting Synthesized Speech.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:604.54KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注