
the ASV cannot recognize the speaker [7,10,25,26,54,57].
Adversarial perturbations can be generated in two ways.
•
Iterative optimization. Optimization-based methods for-
mulate the problem of adversarial perturbation gener-
ation as a constrained optimization problem [7,10,25,
26,57]. As the formulated optimization problems are
usually NP-hard, the solutions can only be approximated
through iterative updates, which is quite time-consuming.
Therefore, iterative optimization methods cannot be ap-
plied to real-time services.
•
One-shot generative model. Generative models can be
trained to produce adversarial perturbations in one shot.
Commonly used generative models include Generative
Adversarial Networks (GAN) and autoencoders [34,35,
46]. As far as we know, there is only one study on gen-
erative model-based adversarial examples against ASV,
named FAPG [54]. However, FAPG mainly focuses on
deceiving ASVs but not preserving intelligibility and
naturalness of audios.
2.5 Threat Model
We define the threat model in terms of the adversary’s knowl-
edge and capability, then we elaborate the performance goals
of voice anonymization under the defined threat model. As
shown in Figure 2, we consider three kinds of adversaries, i.e.,
ignorant (A1), semi-informed (A2), and informed (A3), with
different knowledge and capabilities.
Knowledge.
The adversary has an anonymized audio
whose speaker is unknown. The adversary has collected a
few clean samples of a pool of potential speakers to help
with identity inference. Adversary A1 does not know that
the audio is anonymized. Adversary A2 knows that the audio
is anonymized but does not know the specific anonymizer.
Adversary A3 has full knowledge of the anonymizer.
Capability.
Adversary A1, A2, and A3 can use any ASVs
to infer the speaker of the anonymized audio. As shown in
Figure 2, A1 and A2 enroll potential speakers in the ASV
using clean audios, and A3 enrolls potential speakers in the
ASV using audio samples processed by the anonymizer. In the
inference phase, A1 directly feeds the anonymized audio into
the ASV; A2 applies de-noising methods to the anonymized
audio and feeds the de-noised audio into the ASV; A3 also
directly feeds the anonymized audio into the ASV.
In the face of the knowledge and the capability of adver-
saries, we further elaborate the goal of achieving anonymity
regarding different types of adversaries. More specifically,
in the case of ignorant and semi-informed adversaries, the
speaker of the anonymized audio should be unidentifiable,
and in the case of informed adversaries, the speaker and the
anonymized audio should be unlinkable.
Unidentifiability.
For A1 and A2 who enroll clean
voiceprints into the ASV, the speaker of an anonymized audio
should not be identified during the inference phase.
Unlinkability.
For adversary A3 who enrolls anonymizer-
processed voiceprints into the ASV, the speaker of an
anonymized audio should be undistinguishable from other
speakers.
3 Problem Formulation
Before delving into the design details of V-CLOAK, in this
section, we formally formulate the voice anonymization as a
constrained optimization problem.
Given an audio sample
x= [x1,··· ,xD]∈R1×D
, where
R1×D
is a
D
-dimensional real number field, and
D
is the
length of the audio. Without loss of generality, we assume
xi∈[−1,1]
. We aim to obtain an anonymized audio
˜x
such
that the ASV cannot match the voiceprint of
˜x
with that of
x
. Let
V:R1×· →R1×N
, denote the voiceprint extraction
function that outputs a voiceprint of a fixed length
N
, and
G:R1×D→R1×Dodenote the anonymizer function.
Basic Formulation:
min
GLASV
s.t.k˜x−xk∞≤εand x,˜x∈[−1,1],
where
LASV =(SV(˜x),V(x),untargeted anonymization,
−SV(˜x),v,targeted anonymization.
˜x=(G(x),untargeted anonymization,
G(x,v),targeted anonymization.
(1)
where
ε
constrains the
l∞
norm difference between
x
and
˜x
,
S(·,·)
is the scoring function measuring the similarity be-
tween the voiceprints of
x
and
˜x
, and
v
is the voiceprint of
a speaker other than
x
. With untargeted anonymization, the
voiceprint of the anonymized audio is diverted from that of the
original audio as much as possible, which guarantees uniden-
tifiability, i.e., the voiceprint of the anonymized audio will
not match the voiceprint of the original audio. With targeted
anonymization, the voiceprints of two anonymized audios
with different original speakers but the same target speaker
v
will both be matched with
v
(thus be matched together),
which guarantees both unidentifiability and unlinkability. We
theoretically analyze the unidentifiability and the unlinkabil-
ity of targeted and untargeted anonymizations in Appendix A
and perform corresponding evaluations in §5.
The anonymized audio obtained by Equation (1) satisfies
the basic goal of anonymity but may suffer from quality degra-
dation in terms of intelligibility, naturalness and timbre. To
tackle this problem, we equip the basic optimization problem
with loss terms that address the performance goals of intelligi-
bility, naturalness and timbre preservation. More specifically,
we introduce an ASR-related loss term, which maintains the
5