
DDAM ’22, October 14, 2022, Lisboa, Portugal. Xiaohui Liu et al.
Table 1: Database details of the LF and PF track. Duration, Mean, Max, and Min are all measured in second.
Track Subset Total Genuine Fake Duration (s) Mean (s) Max (s) Min (s)
Training 27,084 3,012 24,072 85,355.51 3.15 60.01 0.86
Development 28,324 2,307 26,017 89,567.78 3.16 60.01 0.86
LF Adaptation1 1,000 300 700 3,627.24 3.63 60.01 1.13
Test1 109,199 - - 383,687.88 3.51 158.14 0.26
PF
Adaptation2 1,052 0 1,052 4,568.56 4.34 13.42 1.30
Adaptation2★2,104 1,052 1,052 - - - -
Test2 100,625 - - 77,625.32 5.94 68.46 1.53
Partially Fake Audio Detection (PF), and Audio Fake Game (FG).
LF task consists of genuine and entirely fake utterances generated
using the text-to-speech (TTS) and voice conversion (VC) algo-
rithms but contains various background noises and disturbances.
PF task comprises genuine and partially fake utterances generated
by combining the original genuine utterances with real or synthe-
sized utterances. According to the challenge plan [
6
], all three tasks
share the same training and development datasets but dierent
adaptation and test datasets. Therefore, domain mismatch between
training and test is a severe challenge for all three tracks.
This paper describes our systems submitted to the ADD 2022. We
focused on the LF and PF track. For the LF track, the proposed low-
quality data augmentation, domain adaptation via netuning, and
greedy fusion of various complementary features were respectively
aggregated in our system. Then we submitted a primary system
for track 1. For the PF track, we conducted several popular Self-
supervised learning (SSL) models with two Bidirectional Long Short-
Term Memory (Bi-LSTM) layers [
7
] and one fully-connected layer
as our basic system. We ne-tuned all models with adaptation set to
reduce the mismatch between train set and test set. We submitted
a single system for track 2.
The rest of the paper is organized as follows: Section 2 presents
the details about ADD 2022 challenge and data. Section 3 describes
our single systems and their setups. Section 4 presents and analyzes
experimental results. In Section 5, we summarize our conclusions.
2 TASK DESCRIPTION AND DATA
The data for the challenge consists of training, development, adap-
tation, and test sets. Utterances in both training and development
sets are noiseless, where genuine utterances are selected from clean
AISHELL-3 [
8
] corpus, and fake utterances are generated by main-
stream speech synthesis and voice conversion systems based on
AISHELL-3. Data for the LF track and the PF track is distributed as
16 kHz, 16 bits per sample WAV les. The information of datasets
is summarized in Table 1. Since the organizers only provide fake
audio for the adaptation set of track 2, we randomly select 1052
(same number as fake) genuine utterances from the training and
development set to make the balance between fake and genuine,
and we call it as Adp2
★
, which will be used to compare SSL models.
For the LF track, the adaptation and test sets are composed of
genuine and fully fake utterances, but the test set is more com-
plicated. Since provided utterances in training and development
sets are noiseless, this track requires developing systems robust to
the noisy environments under the scenario of mismatch between
training and testing data.
PF track is a new topic proposed in 2021 [
9
,
10
]. As the name
suggests, the spoofed utterances in this scenario are partially faked.
Thus genuine parts that exist in the partially audio utterances
can degrade the performance of CM systems, making this track
more challenging. Besides, since fake utterances contain genuine
or spoofed clippings, the key point to detect partially faked audio
is to detect the frame transition or change in the temporal domain.
Metric for both tracks is Equal Error Rate (EER) [6].
3 SYSTEM DESCRIPTION
3.1 Features
As shown in Table 2, we explore various features in three cate-
gories: temporal raw waveform, hand-crafted features, and deep
embedding feature. The combination of these features is expected
to fully capture the spectro-temporal divergences between spoof
and genuine utterances.
Besides, we apply dierent trimming strategies of input for two
dierent tracks. For the LF track, an unied duration of 3s has been
applied for feature extraction. For the PF track, as entire utterance
information is of signicant importance, the entire utterance is
regarded as input
1
. Average Pooling layer is employed to handle
inputs with varied duration.
3.1.1 Temporal raw feature.
Considerable evidence [
11
] shows that avoiding the use of hand-
crafted features with end-to-end architecture may improve the
performance of anti-spoong systems. Therefore, we utilize raw
waveform as input following Sinc lters [12].
3.1.2 Hand-craed features.
Hand-crafted features are most commonly used in anti-spoong,
since they contain specic knowledge. In this subsection, the fol-
lowing applied or designed features are all online GPU features,
which accelerate the training speed.
Spectrogram
. The log-Spectrogram [
13
] is extracted with 50
ms frame length, 25 ms frame shift, 1024 FFT point and hamming
window. Both magnitude and phase spectrogram are extracted,
relatively.
1Actually, upper limitation duration is 15s because of computation limitation