V-C LOAK Intelligibility- Naturalness- Timbre-Preserving Real-Time Voice Anonymization Jiangyi Deng1 Fei Teng1 Yanjiao Chen1 Xiaofu Chen2 Zhaohui Wang2 Wenyuan Xu1

2025-05-06 0 0 1.24MB 24 页 10玖币
侵权投诉
V-CLOAK: Intelligibility-, Naturalness- & Timbre-Preserving
Real-Time Voice Anonymization
Jiangyi Deng1, Fei Teng1, Yanjiao Chen1, Xiaofu Chen2, Zhaohui Wang2, Wenyuan Xu1
1Zhejiang University, 2Wuhan University
Demo: https://v-cloak.com
Abstract
Voice data generated on instant messaging or social me-
dia applications contains unique user voiceprints that may
be abused by malicious adversaries for identity inference or
identity theft. Existing voice anonymization techniques, e.g.,
signal processing and voice conversion/synthesis, suffer from
degradation of perceptual quality. In this paper, we develop a
voice anonymization system, named V-CLOAK, which attains
real-time voice anonymization while preserving the intelli-
gibility, naturalness and timbre of the audio. Our designed
anonymizer features a one-shot generative model that modu-
lates the features of the original audio at different frequency
levels. We train the anonymizer with a carefully-designed loss
function. Apart from the anonymity loss, we further incor-
porate the intelligibility loss and the psychoacoustics-based
naturalness loss. The anonymizer can realize untargeted and
targeted anonymization to achieve the anonymity goals of
unidentifiability and unlinkability.
We have conducted extensive experiments on four datasets,
i.e., LibriSpeech (English), AISHELL (Chinese), Common-
Voice (French) and CommonVoice (Italian), five Automatic
Speaker Verification (ASV) systems (including two DNN-
based, two statistical and one commercial ASV), and eleven
Automatic Speech Recognition (ASR) systems (for differ-
ent languages). Experiment results confirm that V-CLOAK
outperforms five baselines in terms of anonymity perfor-
mance. We also demonstrate that V-CLOAK trained only on
the VoxCeleb1 dataset against ECAPA-TDNN ASV and Deep-
Speech2 ASR has transferable anonymity against other ASVs
and cross-language intelligibility for other ASRs. Further-
more, we verify the robustness of V-CLOAK against various
de-noising techniques and adaptive attacks. Hopefully, V-
CLOAK may provide a cloak for us in a prism world.
1 Introduction
Voiceprint is a critical biometric that can uniquely identify a
*Corresponding author.
Pseudo
Identity
Speaker
Inference
Text
User
Hello…
ASR
Human Listener
Adversary
Natural, Intelligible
Audios
Accurate
Transcriptions
Applications
Operating
System
V-Cloak
Raw
Cloaked
Figure 1: Voiceprint in voice data may be leveraged by mali-
cious adversaries for identity inference or identity theft. The
raw audio is cloaked with V-CLOAK before being passed to
applications, thus malicious service providers or third parties
can only obtain a pseudo identity/voiceprint.
person. As massive personal data is collected and processed
by online services, there are rising concerns for privacy leak-
age. In 2018, the European Union enforced the General Data
Protection Regulation (GDPR) [1] for personal data protec-
tion, especially for biometric data. However, an avalanche
of voice data is generated daily on social media (e.g. Face-
book/Meta, WeChat, TikTok) and in communication applica-
tions (e.g. Zoom, Slack, Microsoft Teams, Ding Talk), and
automated processing methods, e.g., ASV, can easily extract
voiceprint for ill use. For example, as shown in Figure 1, an
adversary may infer the speaker identity of a private conver-
sation from voice messages uploaded to the cloud with an
ASV [18,21,27]. Therefore, there is an urgent demand for
voice anonymization to help users protect voiceprint while
enjoying voice-related services (e.g., speech recognition by
ASR) and interpersonal communication (e.g., human listeners
can identify the speaker).
Existing voice anonymization methods are mainly based
on voice signal processing (SP), voice conversion (VC) and
voice synthesis (VS). SP [33,52] methods directly apply
signal processing techniques to modify speaker-related fea-
tures in audios to obscure voiceprints. Nonetheless, SP-based
voice anonymization usually induces large quality degra-
1
arXiv:2210.15140v1 [cs.SD] 27 Oct 2022
Table 1: V-CLOAK versus existing works.
Method TypeIntelligibility#Naturalness#Timbre-preserving Real-Time Coef.User-agnostic
VoiceMask [37,38] VC % % % 0.041 !
Yoo [55] VC % ! % N.K. %
NSF [13,15] VS ! ! % 0.110 %
HFGAN [28] VS ! ! % 0.104 !
Justin [20] VS ! ! % N.K. !
McAdams [33] SP % % % 0.030 !
Vaidya [52] SP % % % N.K. !
V-CLOAK (Ours) Adv ! ! ! 0.011 !
(i)
: Voice Conversion (VC). Voice Synthesis (VS). Signal Processing (SP). Adversarial examples (Adv). (ii)
#
: whether the method has
explicit constraints on intelligibility or naturalness. (iii)
: Real-time coefficient (RTC), the ratio between the processing time and the duration of
the audio.
The lower the RTC, the more efficient the method.
We measure the five methods under the same computing resource conditions.
N.K., not known, the authors did not evaluate the efficiency of their methods or make their codes available. (iv)
: whether the method needs to
be trained for a new user.
dation as intelligibility and naturalness are not considered.
VC [37,38,47,55] and VS [13,15,20,28] methods convert
the original audio into another audio that sounds completely
different from the original speaker. Although VC and VS
may achieve anonymity, they are not suitable for scenarios
where the user wants to hide their identity from ASVs but
hopes to preserve their personal timbre to human audiences,
e.g., posts of celebrities on social media, voice messages with
acquaintances.
In this paper, we make the first attempt to design a real-
time voice anonymization system, named V-CLOAK, which
achieves anonymity while preserving intelligibility, natural-
ness, and timbre of the audios. A comparison of V-CLOAK
with existing works is shown in Table 1. Nonetheless, to re-
alize these design goals with a practical real-time system is
challenging in three aspects.
How to achieve real-time voice anonymity against adap-
tive attacks?
Different from traditional signal processing and voice con-
version & synthesis, we are inspired by the adversarial ex-
amples that can trick ASV into misidentifying the speaker
but induce imperceptible differences to the human auditory
system. Nonetheless, directly applying adversarial examples
to voice anonymization has two major issues. First, most of
the existing ASV adversarial examples [7,10,25,26,57] are
constructed via iterative updates, which cannot achieve real-
time voice anonymization. As far as we are concerned, there
is only one ASV adversarial attack named FAPG that creates
adversarial examples using a one-shot generative model [54].
Unfortunately, FAPG needs to train a feature map for each
potential target speaker and the original paper only evaluates
for an ASV with 10 speakers. Furthermore, the adversary
may be informed of the anonymization method and the model
(anonymizer), and then launches an adaptive attack to de-
anonymize the anonymized audio.
To tackle these problems, we adapt a lightweight genera-
tive model Wave-U-Net [48] for V-CLOAK. We equip Wave-
U-Net with two novel components, i.e., VP-Modulation and
Throttle.VP-Modulation modulates the feature elements of
the original audio at each frequency level according to the
voiceprint of a target speaker. Throttle adjusts the weights
of features of the original audio at different frequency levels
to conform to the constraint on the anonymization pertur-
bations. The trained anonymizer can produce anonymized
audios targeting any speaker/voiceprint under any anonymiza-
tion perturbation constraint without re-training. Furthermore,
we conduct theoretical analysis and experiments to verify the
anonymity of V-CLOAK in the case of adaptive attacks.
How to maintain objective and subjective intelligibility
of anonymized audios?
It is desirable for the anonymized audios to be intelligible
to ASRs (objective intelligibility) such that the users can
still enjoy speech-to-text services; and to humans (subjective
intelligibility) such that voice messages can be understood.
However, SP- and VC-based anonymization, as well as voice
adversarial examples, do not consider intelligibility constraint
and may introduce noises that greatly degrade intelligibility.
To address this issue, we impose an intelligibility loss when
training the anonymizer. The intelligibility loss is based on the
decoding error rate of the ASR. Instead of the commonly-used
Connectionist Temporal Classification (CTC) loss of ASR, we
acquire the graphemic posteriorgram (GPG) loss, which pre-
serves the full alignment of the transcription and the grapheme
of each frame. The subjective intelligibility is achieved by
constraining the anonymization perturbations by our proposed
Throttle module and better masking the anonymization per-
turbations based on psychoacoustics.
How to preserve naturalness and timbre of anonymized
audios?
2
Naturalness and timbre preservation are important to hu-
man audiences or listeners of anonymized audios. Signal pro-
cessing and existing ASV adversarial examples did not con-
sider naturalness such that the processed audios may sound
mechanical. In addition, signal processing, voice conversion
and voice synthesis all distort the timbre of the original
speaker such that the anonymized audio sounds unlike being
spoken by the original speaker (e.g., a friend or a celebrity).
To cope with this problem, we introduce a naturalness &
timbre loss when training the anonymizer based on the psy-
choacoustic theory of masking effects. Our user study verifies
that the anonymized audios of V-CLOAK receive high natu-
ralness and timbre scores.
We implement a fully-functional prototype of V-CLOAK,
evaluated with extensive experiments on five ASVs
(anonymity) and eleven ASRs (intelligibility) with datasets
of four languages (English, Chinese, French, Italian). The
comparison with five baselines demonstrates that V-CLOAK
achieves the best anonymization performance with the second-
best intelligibility performance. Cross-language experiments
show that the anonymizer of V-CLOAK trained on one ASV
and one ASR can be transferred to other ASVs and ASRs
(with different languages). A user study with 102 volun-
teers confirms the intelligibility-, naturalness- and timbre-
preserving properties of V-CLOAK.
We summarize our main contributions as follows.
We propose V-CLOAK, an intelligibility-, naturalness-
and timbre-preserving voice anonymization system. V-
CLOAK is proved and evaluated to fulfil the anonymiza-
tion goals of unidentifiability and unlinkability against
naive and adaptive adversaries.
We develop a real-time anonymizer that transforms the
original audio into targeted or untargeted anonymized
audios. The anonymizer is trained with anonymity, in-
telligibility, naturalness and timbre loss, generalizing to
any new original speaker or new target speaker without
the need for re-training.
We conduct extensive experiments to verify the effective-
ness and efficiency of V-CLOAK under various testing
conditions and a user study to confirm the practicality
and applicability of V-CLOAK.
2 Background
2.1 Voice Data
In the digital world, voice data of a user is massively generated
and distributed for various purposes, e.g., communications
via voice messages or video posts on social media. These
wildly exposed voice data may be easily collected by service
providers or third parties. For instance, Facebook is collecting
audio data from voice messages on its social network platform,
and even attempts to transcribe the content of these private
messages [21]. TikTok revised its privacy policy to legitimize
faceprints and voiceprints collection from the videos uploaded
by users, and even claimed the possibility of data sharing for
business purposes [27].
Voice data contains two kinds of information, i.e., speech
contents and phonetic features.
Speech contents. Speech contents refer to the linguis-
tic information contained in the voice data, i.e., "what
are the words spoken." Speech contents determine the
intelligibility of the voice data.
Phonetic features. Phonetic features refer to the way
the speech contents are conveyed in the voice data, i.e.,
"how are the words spoken." Phonetic features affect the
timbre of the voice data.
Voiceprint is a phonetic feature that can uniquely iden-
tify a speaker. However, voiceprint contained in voice data
may be abused for identity inference or identity theft. On
the one hand, voiceprint may be used to infer the identity of
speakers of a private conversation by automatic speaker veri-
fication (ASV) systems. On the other hand, the voiceprint of
a speaker may be extracted from audios to synthesize audios
to pass voiceprint-based authentication systems. For example,
WeChat, a popular messaging app in China, allows users to
login via voiceprint [3]. In face of these potential privacy leak-
ages, it is essential for users to anonymize voice data before
sending voice messages or publishing videos on social media.
2.2 Psychoacoustics
Psychoacoustics is the study of the relationship between sub-
jective psychological perceptions (e.g., perceived volume,
pitch) and objective physical parameters (e.g., sound pres-
sure level, frequency) [56]. The masking effect is one of the
most common psychoacoustic phenomena [14]. There are
two forms of masking: temporal and spectral. Temporal mask-
ing refers to the situation where a sound cannot be perceived
if a sudden louder sound appears immediately preceding or
following the first one. The louder sound is called the masker.
Spectral masking refers to the imperceptibility of a sound
component due to other frequency components played simul-
taneously. The perception threshold of this component varies
due to both sound signals (e.g., frequency) and the listener. We
leverage the spectral masking effect to make anonymization
perturbations more imperceptible to human users.
2.3 Automatic Speaker Verification & Auto-
matic Speech Recognition
An Automatic Speaker Verification (ASV) system aims to
deduce the speakers of audios based on their voiceprints.
3
Speaker inference via an ASV system includes the enroll-
ment phase and the inference phase. In the enrollment phase,
clean audio samples of the speaker to be recognized are fed
into the ASV such that the voiceprint can be extracted and
stored in the ASV. In the inference phase, the ASV takes an
audio sample as input and outputs whether the input audio
belongs to the enrolled speaker. There are two mainstream
methods of extracting and matching voiceprints, i.e., statisti-
cal models and Deep Neural Network (DNN)-based models.
Gaussian mixture model (GMM) is a traditional statistical
model to extract ivector voiceprints. ivector-PLDA is a popu-
lar ASV implementation that matches ivector voiceprints via
probabilistic linear discriminant analysis (PLDA). X-vector is
a DNN-based voiceprint extractor, which outperforms GMM
as DNNs are more effective in extracting feature represen-
tations from large-scale voice datasets. ECAPA-TDNN [9]
is the state-of-the-art ASV implementation using end-to-end
training, i.e., training the front-end and the back-end jointly
as an integrated network [53].
An Automatic Speech Recognition (ASR) system aims to
transcribe the speech contents from audio samples (without
the need to know the speaker). In the training process, au-
dios are first transformed into a sequence of spectral frames.
Each frame is then transformed into a feature vector. Com-
monly used features include Filter Bank (FBank) [42], Mel-
Frequency Cepstral Coefficients (MFCC) [29], Spectral Sub-
band Centroid (SSC) [50] and Perceptual Linear Predictive
(PLP) [16]. Then, the posterior probability of the lingueme
(e.g., phoneme, grapheme, or word) contained in each frame
is estimated. The linguemes are usually represented as tokens.
For example, 29 tokens are used for the English language, i.e.,
letters a
z, space, apostrophe and the special blank token
φ
.
Next, the Connectionist Temporal Classification (CTC) mod-
ule sums the probability of all possible alignments that reduce
to the ground-truth sequence. For example, a three-frame se-
quence of
[a b φ],[aφb]
and
[φa b]
will all be reduced to the
ground-truth sequence of
[a b]
. Finally, the model is updated
to increase the probability of producing the ground-truth se-
quence. In the inference phase, a language model may be used
to provide a prior probability to find the lingueme sequence
of the highest probability.
2.4 Voice Anonymization
Voice anonymization refers to the practice of removing
voiceprint from voice data. A voice anonymization system
needs to satisfy various requirements to fulfil different pur-
poses. Regarding the digital voice data privacy, we aim to
achieve the following performance goals.
Anonymity
. An anonymized audio should not reveal the
identity of the speaker. More specifically, we consider con-
cealing speaker identities in the digital domain from ASVs.
Intelligibility
. An anonymized audio should be intelligible
to both humans and ASRs. More specifically, the speech con-
ASV
ASV
ASV
ASV
Inference
B0
Baseline
Enroll
Inference
A1
Unidentifiability
Enroll
DN
A2
Unidentifiability
Enroll
Inference
A3
Unlinkability
Enroll
Speaker Inference
Inference
G
G
Anonymization
G
G
Clean Anonymized GAnonymizer DN De-noising
Threat Model
Figure 2: Threat model.
A1:
ignorant adversary who enrolls
clean audios into the ASV and feeds anonymized audio into
the ASV to infer the speaker.
A2:
semi-informed adversary
who enrolls clean audios into the ASV and feeds de-noised
anonymized audio into the ASV to infer the speaker.
A3:
informed adversary who enrolls anonymizer-processed audios
into the ASV and feeds the anonymized audio into the ASV
to infer the speaker.
tents of the anonymized audio can be correctly understood by
humans and transcribed by ASRs.
Naturalness & Timbre
. An anonymized audio should
sound natural and like the timbre of the original speaker to hu-
mans. Studies show that most people find highly mechanical
audios irritating and discomforting to listen to, thus natural-
sounding anonymized audios are more user-friendly [49]. For
voice messages and video posts on social media, it is ideal to
make the anonymized audios sound authentic as the original
speaker to audiences, especially for communications between
acquaintances and publicity of celebrities.
Voice anonymization can be realized in various ways, as
summarized in Table 1.
Voice signal processing. Signal processing techniques at-
tempt to contort the voiceprint by directly modifying the
voice signals in terms of formant positions, pitch, tempo, or
pause [33,52]. Though simple and fast, signal processing may
degrade the intelligibility and naturalness of audios.
Voice conversion & synthesis. Voice conversion & syn-
thesis techniques aim to replace the voiceprint of the orig-
inal speaker in an audio with the voiceprint of another
speaker [13,15,20,28,37,38,47,55]. Voice conversion &
synthesis preserve the intelligibility and naturalness, but al-
ter the timbre so that the audio sounds unlike the original
speaker. This may reduce the authenticity of voice messages
to acquaintances and video posts of celebrities.
Voice adversarial examples. Adversarial example attacks
against ASVs add imperceptible noises to audios such that
4
the ASV cannot recognize the speaker [7,10,25,26,54,57].
Adversarial perturbations can be generated in two ways.
Iterative optimization. Optimization-based methods for-
mulate the problem of adversarial perturbation gener-
ation as a constrained optimization problem [7,10,25,
26,57]. As the formulated optimization problems are
usually NP-hard, the solutions can only be approximated
through iterative updates, which is quite time-consuming.
Therefore, iterative optimization methods cannot be ap-
plied to real-time services.
One-shot generative model. Generative models can be
trained to produce adversarial perturbations in one shot.
Commonly used generative models include Generative
Adversarial Networks (GAN) and autoencoders [34,35,
46]. As far as we know, there is only one study on gen-
erative model-based adversarial examples against ASV,
named FAPG [54]. However, FAPG mainly focuses on
deceiving ASVs but not preserving intelligibility and
naturalness of audios.
2.5 Threat Model
We define the threat model in terms of the adversary’s knowl-
edge and capability, then we elaborate the performance goals
of voice anonymization under the defined threat model. As
shown in Figure 2, we consider three kinds of adversaries, i.e.,
ignorant (A1), semi-informed (A2), and informed (A3), with
different knowledge and capabilities.
Knowledge.
The adversary has an anonymized audio
whose speaker is unknown. The adversary has collected a
few clean samples of a pool of potential speakers to help
with identity inference. Adversary A1 does not know that
the audio is anonymized. Adversary A2 knows that the audio
is anonymized but does not know the specific anonymizer.
Adversary A3 has full knowledge of the anonymizer.
Capability.
Adversary A1, A2, and A3 can use any ASVs
to infer the speaker of the anonymized audio. As shown in
Figure 2, A1 and A2 enroll potential speakers in the ASV
using clean audios, and A3 enrolls potential speakers in the
ASV using audio samples processed by the anonymizer. In the
inference phase, A1 directly feeds the anonymized audio into
the ASV; A2 applies de-noising methods to the anonymized
audio and feeds the de-noised audio into the ASV; A3 also
directly feeds the anonymized audio into the ASV.
In the face of the knowledge and the capability of adver-
saries, we further elaborate the goal of achieving anonymity
regarding different types of adversaries. More specifically,
in the case of ignorant and semi-informed adversaries, the
speaker of the anonymized audio should be unidentifiable,
and in the case of informed adversaries, the speaker and the
anonymized audio should be unlinkable.
Unidentifiability.
For A1 and A2 who enroll clean
voiceprints into the ASV, the speaker of an anonymized audio
should not be identified during the inference phase.
Unlinkability.
For adversary A3 who enrolls anonymizer-
processed voiceprints into the ASV, the speaker of an
anonymized audio should be undistinguishable from other
speakers.
3 Problem Formulation
Before delving into the design details of V-CLOAK, in this
section, we formally formulate the voice anonymization as a
constrained optimization problem.
Given an audio sample
x= [x1,··· ,xD]R1×D
, where
R1×D
is a
D
-dimensional real number field, and
D
is the
length of the audio. Without loss of generality, we assume
xi[1,1]
. We aim to obtain an anonymized audio
˜x
such
that the ASV cannot match the voiceprint of
˜x
with that of
x
. Let
V:R1×· R1×N
, denote the voiceprint extraction
function that outputs a voiceprint of a fixed length
N
, and
G:R1×DR1×Dodenote the anonymizer function.
Basic Formulation:
min
GLASV
s.t.k˜xxkεand x,˜x[1,1],
where
LASV =(SV(˜x),V(x),untargeted anonymization,
SV(˜x),v,targeted anonymization.
˜x=(G(x),untargeted anonymization,
G(x,v),targeted anonymization.
(1)
where
ε
constrains the
l
norm difference between
x
and
˜x
,
S(·,·)
is the scoring function measuring the similarity be-
tween the voiceprints of
x
and
˜x
, and
v
is the voiceprint of
a speaker other than
x
. With untargeted anonymization, the
voiceprint of the anonymized audio is diverted from that of the
original audio as much as possible, which guarantees uniden-
tifiability, i.e., the voiceprint of the anonymized audio will
not match the voiceprint of the original audio. With targeted
anonymization, the voiceprints of two anonymized audios
with different original speakers but the same target speaker
v
will both be matched with
v
(thus be matched together),
which guarantees both unidentifiability and unlinkability. We
theoretically analyze the unidentifiability and the unlinkabil-
ity of targeted and untargeted anonymizations in Appendix A
and perform corresponding evaluations in §5.
The anonymized audio obtained by Equation (1) satisfies
the basic goal of anonymity but may suffer from quality degra-
dation in terms of intelligibility, naturalness and timbre. To
tackle this problem, we equip the basic optimization problem
with loss terms that address the performance goals of intelligi-
bility, naturalness and timbre preservation. More specifically,
we introduce an ASR-related loss term, which maintains the
5
摘要:

V-CLOAK:Intelligibility-,Naturalness-&Timbre-PreservingReal-TimeVoiceAnonymizationJiangyiDeng1,FeiTeng1,YanjiaoChen1,XiaofuChen2,ZhaohuiWang2,WenyuanXu11ZhejiangUniversity,2WuhanUniversityDemo:https://v-cloak.comAbstractVoicedatageneratedoninstantmessagingorsocialme-diaapplicationscontainsuniqueuse...

展开>> 收起<<
V-C LOAK Intelligibility- Naturalness- Timbre-Preserving Real-Time Voice Anonymization Jiangyi Deng1 Fei Teng1 Yanjiao Chen1 Xiaofu Chen2 Zhaohui Wang2 Wenyuan Xu1.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:1.24MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注