V-C LOAK Intelligibility- Naturalness- Timbre-Preserving Real-Time Voice Anonymization Jiangyi Deng1 Fei Teng1 Yanjiao Chen1 Xiaofu Chen2 Zhaohui Wang2 Wenyuan Xu1

2025-05-06 0 0 1.24MB 24 页 10玖币

侵权投诉

V-CLOAK: Intelligibility-, Naturalness- & Timbre-Preserving

Real-Time Voice Anonymization

Jiangyi Deng1, Fei Teng1, Yanjiao Chen1∗, Xiaofu Chen2, Zhaohui Wang2, Wenyuan Xu1

1Zhejiang University, 2Wuhan University

Demo: https://v-cloak.com

Abstract

Voice data generated on instant messaging or social me-

dia applications contains unique user voiceprints that may

be abused by malicious adversaries for identity inference or

identity theft. Existing voice anonymization techniques, e.g.,

signal processing and voice conversion/synthesis, suffer from

degradation of perceptual quality. In this paper, we develop a

voice anonymization system, named V-CLOAK, which attains

real-time voice anonymization while preserving the intelli-

gibility, naturalness and timbre of the audio. Our designed

anonymizer features a one-shot generative model that modu-

lates the features of the original audio at different frequency

levels. We train the anonymizer with a carefully-designed loss

function. Apart from the anonymity loss, we further incor-

porate the intelligibility loss and the psychoacoustics-based

naturalness loss. The anonymizer can realize untargeted and

targeted anonymization to achieve the anonymity goals of

unidentiﬁability and unlinkability.

We have conducted extensive experiments on four datasets,

i.e., LibriSpeech (English), AISHELL (Chinese), Common-

Voice (French) and CommonVoice (Italian), ﬁve Automatic

Speaker Veriﬁcation (ASV) systems (including two DNN-

based, two statistical and one commercial ASV), and eleven

Automatic Speech Recognition (ASR) systems (for differ-

ent languages). Experiment results conﬁrm that V-CLOAK

outperforms ﬁve baselines in terms of anonymity perfor-

mance. We also demonstrate that V-CLOAK trained only on

the VoxCeleb1 dataset against ECAPA-TDNN ASV and Deep-

Speech2 ASR has transferable anonymity against other ASVs

and cross-language intelligibility for other ASRs. Further-

more, we verify the robustness of V-CLOAK against various

de-noising techniques and adaptive attacks. Hopefully, V-

CLOAK may provide a cloak for us in a prism world.

1 Introduction

Voiceprint is a critical biometric that can uniquely identify a

*Corresponding author.

Pseudo

Identity

Speaker

Inference

Text

User

Hello…

ASR

Human Listener

Adversary

Natural, Intelligible

Audios

Accurate

Transcriptions

…

Applications

Operating

System

V-Cloak

Raw

Cloaked

Figure 1: Voiceprint in voice data may be leveraged by mali-

cious adversaries for identity inference or identity theft. The

raw audio is cloaked with V-CLOAK before being passed to

applications, thus malicious service providers or third parties

can only obtain a pseudo identity/voiceprint.

person. As massive personal data is collected and processed

by online services, there are rising concerns for privacy leak-

age. In 2018, the European Union enforced the General Data

Protection Regulation (GDPR) [1] for personal data protec-

tion, especially for biometric data. However, an avalanche

of voice data is generated daily on social media (e.g. Face-

book/Meta, WeChat, TikTok) and in communication applica-

tions (e.g. Zoom, Slack, Microsoft Teams, Ding Talk), and

automated processing methods, e.g., ASV, can easily extract

voiceprint for ill use. For example, as shown in Figure 1, an

adversary may infer the speaker identity of a private conver-

sation from voice messages uploaded to the cloud with an

ASV [18,21,27]. Therefore, there is an urgent demand for

voice anonymization to help users protect voiceprint while

enjoying voice-related services (e.g., speech recognition by

ASR) and interpersonal communication (e.g., human listeners

can identify the speaker).

Existing voice anonymization methods are mainly based

on voice signal processing (SP), voice conversion (VC) and

voice synthesis (VS). SP [33,52] methods directly apply

signal processing techniques to modify speaker-related fea-

tures in audios to obscure voiceprints. Nonetheless, SP-based

voice anonymization usually induces large quality degra-

arXiv:2210.15140v1 [cs.SD] 27 Oct 2022

Table 1: V-CLOAK versus existing works.

Method Type∗Intelligibility#Naturalness#Timbre-preserving Real-Time Coef.↓User-agnostic†

VoiceMask [37,38] VC % % % 0.041 !

Yoo [55] VC % ! % N.K. %

NSF [13,15] VS ! ! % 0.110 %

HFGAN [28] VS ! ! % 0.104 !

Justin [20] VS ! ! % N.K. !

McAdams [33] SP % % % 0.030 !

Vaidya [52] SP % % % N.K. !

V-CLOAK (Ours) Adv ! ! ! 0.011 !

(i)

∗

: Voice Conversion (VC). Voice Synthesis (VS). Signal Processing (SP). Adversarial examples (Adv). (ii)

: whether the method has

explicit constraints on intelligibility or naturalness. (iii)

↓

: Real-time coefﬁcient (RTC), the ratio between the processing time and the duration of

the audio.

The lower the RTC, the more efﬁcient the method.

We measure the ﬁve methods under the same computing resource conditions.

N.K., not known, the authors did not evaluate the efﬁciency of their methods or make their codes available. (iv)

†

: whether the method needs to

be trained for a new user.

dation as intelligibility and naturalness are not considered.

VC [37,38,47,55] and VS [13,15,20,28] methods convert

the original audio into another audio that sounds completely

different from the original speaker. Although VC and VS

may achieve anonymity, they are not suitable for scenarios

where the user wants to hide their identity from ASVs but

hopes to preserve their personal timbre to human audiences,

e.g., posts of celebrities on social media, voice messages with

acquaintances.

In this paper, we make the ﬁrst attempt to design a real-

time voice anonymization system, named V-CLOAK, which

achieves anonymity while preserving intelligibility, natural-

ness, and timbre of the audios. A comparison of V-CLOAK

with existing works is shown in Table 1. Nonetheless, to re-

alize these design goals with a practical real-time system is

challenging in three aspects.

•

How to achieve real-time voice anonymity against adap-

tive attacks?

Different from traditional signal processing and voice con-

version & synthesis, we are inspired by the adversarial ex-

amples that can trick ASV into misidentifying the speaker

but induce imperceptible differences to the human auditory

system. Nonetheless, directly applying adversarial examples

to voice anonymization has two major issues. First, most of

the existing ASV adversarial examples [7,10,25,26,57] are

constructed via iterative updates, which cannot achieve real-

time voice anonymization. As far as we are concerned, there

is only one ASV adversarial attack named FAPG that creates

adversarial examples using a one-shot generative model [54].

Unfortunately, FAPG needs to train a feature map for each

potential target speaker and the original paper only evaluates

for an ASV with 10 speakers. Furthermore, the adversary

may be informed of the anonymization method and the model

(anonymizer), and then launches an adaptive attack to de-

anonymize the anonymized audio.

To tackle these problems, we adapt a lightweight genera-

tive model Wave-U-Net [48] for V-CLOAK. We equip Wave-

U-Net with two novel components, i.e., VP-Modulation and

Throttle.VP-Modulation modulates the feature elements of

the original audio at each frequency level according to the

voiceprint of a target speaker. Throttle adjusts the weights

of features of the original audio at different frequency levels

to conform to the constraint on the anonymization pertur-

bations. The trained anonymizer can produce anonymized

audios targeting any speaker/voiceprint under any anonymiza-

tion perturbation constraint without re-training. Furthermore,

we conduct theoretical analysis and experiments to verify the

anonymity of V-CLOAK in the case of adaptive attacks.

•

How to maintain objective and subjective intelligibility

of anonymized audios?

It is desirable for the anonymized audios to be intelligible

to ASRs (objective intelligibility) such that the users can

still enjoy speech-to-text services; and to humans (subjective

intelligibility) such that voice messages can be understood.

However, SP- and VC-based anonymization, as well as voice

adversarial examples, do not consider intelligibility constraint

and may introduce noises that greatly degrade intelligibility.

To address this issue, we impose an intelligibility loss when

training the anonymizer. The intelligibility loss is based on the

decoding error rate of the ASR. Instead of the commonly-used

Connectionist Temporal Classiﬁcation (CTC) loss of ASR, we

acquire the graphemic posteriorgram (GPG) loss, which pre-

serves the full alignment of the transcription and the grapheme

of each frame. The subjective intelligibility is achieved by

constraining the anonymization perturbations by our proposed

Throttle module and better masking the anonymization per-

turbations based on psychoacoustics.

•How to preserve naturalness and timbre of anonymized

audios?

Naturalness and timbre preservation are important to hu-

man audiences or listeners of anonymized audios. Signal pro-

cessing and existing ASV adversarial examples did not con-

sider naturalness such that the processed audios may sound

mechanical. In addition, signal processing, voice conversion

and voice synthesis all distort the timbre of the original

speaker such that the anonymized audio sounds unlike being

spoken by the original speaker (e.g., a friend or a celebrity).

To cope with this problem, we introduce a naturalness &

timbre loss when training the anonymizer based on the psy-

choacoustic theory of masking effects. Our user study veriﬁes

that the anonymized audios of V-CLOAK receive high natu-

ralness and timbre scores.

We implement a fully-functional prototype of V-CLOAK,

evaluated with extensive experiments on ﬁve ASVs

(anonymity) and eleven ASRs (intelligibility) with datasets

of four languages (English, Chinese, French, Italian). The

comparison with ﬁve baselines demonstrates that V-CLOAK

achieves the best anonymization performance with the second-

best intelligibility performance. Cross-language experiments

show that the anonymizer of V-CLOAK trained on one ASV

and one ASR can be transferred to other ASVs and ASRs

(with different languages). A user study with 102 volun-

teers conﬁrms the intelligibility-, naturalness- and timbre-

preserving properties of V-CLOAK.

We summarize our main contributions as follows.

•

We propose V-CLOAK, an intelligibility-, naturalness-

and timbre-preserving voice anonymization system. V-

CLOAK is proved and evaluated to fulﬁl the anonymiza-

tion goals of unidentiﬁability and unlinkability against

naive and adaptive adversaries.

•

We develop a real-time anonymizer that transforms the

original audio into targeted or untargeted anonymized

audios. The anonymizer is trained with anonymity, in-

telligibility, naturalness and timbre loss, generalizing to

any new original speaker or new target speaker without

the need for re-training.

•

We conduct extensive experiments to verify the effective-

ness and efﬁciency of V-CLOAK under various testing

conditions and a user study to conﬁrm the practicality

and applicability of V-CLOAK.

2 Background

2.1 Voice Data

In the digital world, voice data of a user is massively generated

and distributed for various purposes, e.g., communications

via voice messages or video posts on social media. These

wildly exposed voice data may be easily collected by service

providers or third parties. For instance, Facebook is collecting

audio data from voice messages on its social network platform,

and even attempts to transcribe the content of these private

messages [21]. TikTok revised its privacy policy to legitimize

faceprints and voiceprints collection from the videos uploaded

by users, and even claimed the possibility of data sharing for

business purposes [27].

Voice data contains two kinds of information, i.e., speech

contents and phonetic features.

•

Speech contents. Speech contents refer to the linguis-

tic information contained in the voice data, i.e., "what

are the words spoken." Speech contents determine the

intelligibility of the voice data.

•

Phonetic features. Phonetic features refer to the way

the speech contents are conveyed in the voice data, i.e.,

"how are the words spoken." Phonetic features affect the

timbre of the voice data.

Voiceprint is a phonetic feature that can uniquely iden-

tify a speaker. However, voiceprint contained in voice data

may be abused for identity inference or identity theft. On

the one hand, voiceprint may be used to infer the identity of

speakers of a private conversation by automatic speaker veri-

ﬁcation (ASV) systems. On the other hand, the voiceprint of

a speaker may be extracted from audios to synthesize audios

to pass voiceprint-based authentication systems. For example,

WeChat, a popular messaging app in China, allows users to

ages, it is essential for users to anonymize voice data before

sending voice messages or publishing videos on social media.

2.2 Psychoacoustics

Psychoacoustics is the study of the relationship between sub-

jective psychological perceptions (e.g., perceived volume,

pitch) and objective physical parameters (e.g., sound pres-

sure level, frequency) [56]. The masking effect is one of the

most common psychoacoustic phenomena [14]. There are

two forms of masking: temporal and spectral. Temporal mask-

ing refers to the situation where a sound cannot be perceived

if a sudden louder sound appears immediately preceding or

following the ﬁrst one. The louder sound is called the masker.

Spectral masking refers to the imperceptibility of a sound

component due to other frequency components played simul-

taneously. The perception threshold of this component varies

due to both sound signals (e.g., frequency) and the listener. We

leverage the spectral masking effect to make anonymization

perturbations more imperceptible to human users.

2.3 Automatic Speaker Veriﬁcation & Auto-

matic Speech Recognition

An Automatic Speaker Veriﬁcation (ASV) system aims to

deduce the speakers of audios based on their voiceprints.

Speaker inference via an ASV system includes the enroll-

ment phase and the inference phase. In the enrollment phase,

clean audio samples of the speaker to be recognized are fed

into the ASV such that the voiceprint can be extracted and

stored in the ASV. In the inference phase, the ASV takes an

audio sample as input and outputs whether the input audio

belongs to the enrolled speaker. There are two mainstream

methods of extracting and matching voiceprints, i.e., statisti-

cal models and Deep Neural Network (DNN)-based models.

Gaussian mixture model (GMM) is a traditional statistical

model to extract ivector voiceprints. ivector-PLDA is a popu-

lar ASV implementation that matches ivector voiceprints via

probabilistic linear discriminant analysis (PLDA). X-vector is

a DNN-based voiceprint extractor, which outperforms GMM

as DNNs are more effective in extracting feature represen-

tations from large-scale voice datasets. ECAPA-TDNN [9]

is the state-of-the-art ASV implementation using end-to-end

training, i.e., training the front-end and the back-end jointly

as an integrated network [53].

An Automatic Speech Recognition (ASR) system aims to

transcribe the speech contents from audio samples (without

the need to know the speaker). In the training process, au-

dios are ﬁrst transformed into a sequence of spectral frames.

Each frame is then transformed into a feature vector. Com-

monly used features include Filter Bank (FBank) [42], Mel-

Frequency Cepstral Coefﬁcients (MFCC) [29], Spectral Sub-

band Centroid (SSC) [50] and Perceptual Linear Predictive

(PLP) [16]. Then, the posterior probability of the lingueme

(e.g., phoneme, grapheme, or word) contained in each frame

is estimated. The linguemes are usually represented as tokens.

For example, 29 tokens are used for the English language, i.e.,

letters a

∼

z, space, apostrophe and the special blank token

Next, the Connectionist Temporal Classiﬁcation (CTC) mod-

ule sums the probability of all possible alignments that reduce

to the ground-truth sequence. For example, a three-frame se-

quence of

[a b φ],[aφb]

and

[φa b]

will all be reduced to the

ground-truth sequence of

[a b]

. Finally, the model is updated

to increase the probability of producing the ground-truth se-

quence. In the inference phase, a language model may be used

to provide a prior probability to ﬁnd the lingueme sequence

of the highest probability.

2.4 Voice Anonymization

Voice anonymization refers to the practice of removing

voiceprint from voice data. A voice anonymization system

needs to satisfy various requirements to fulﬁl different pur-

poses. Regarding the digital voice data privacy, we aim to

achieve the following performance goals.

Anonymity

. An anonymized audio should not reveal the

identity of the speaker. More speciﬁcally, we consider con-

cealing speaker identities in the digital domain from ASVs.

Intelligibility

. An anonymized audio should be intelligible

to both humans and ASRs. More speciﬁcally, the speech con-

ASV

Inference

Baseline

Enroll

Inference

Unidentifiability

Enroll

Unidentifiability

Enroll

Inference

Unlinkability

Enroll

Speaker Inference

Inference

Anonymization

Clean Anonymized GAnonymizer DN De-noising

Threat Model

Figure 2: Threat model.

A1:

ignorant adversary who enrolls

clean audios into the ASV and feeds anonymized audio into

the ASV to infer the speaker.

A2:

semi-informed adversary

who enrolls clean audios into the ASV and feeds de-noised

anonymized audio into the ASV to infer the speaker.

A3:

informed adversary who enrolls anonymizer-processed audios

into the ASV and feeds the anonymized audio into the ASV

to infer the speaker.

tents of the anonymized audio can be correctly understood by

humans and transcribed by ASRs.

Naturalness & Timbre

. An anonymized audio should

sound natural and like the timbre of the original speaker to hu-

mans. Studies show that most people ﬁnd highly mechanical

audios irritating and discomforting to listen to, thus natural-

sounding anonymized audios are more user-friendly [49]. For

voice messages and video posts on social media, it is ideal to

make the anonymized audios sound authentic as the original

speaker to audiences, especially for communications between

acquaintances and publicity of celebrities.

Voice anonymization can be realized in various ways, as

summarized in Table 1.

Voice signal processing. Signal processing techniques at-

tempt to contort the voiceprint by directly modifying the

voice signals in terms of formant positions, pitch, tempo, or

pause [33,52]. Though simple and fast, signal processing may

degrade the intelligibility and naturalness of audios.

Voice conversion & synthesis. Voice conversion & syn-

thesis techniques aim to replace the voiceprint of the orig-

inal speaker in an audio with the voiceprint of another

speaker [13,15,20,28,37,38,47,55]. Voice conversion &

synthesis preserve the intelligibility and naturalness, but al-

ter the timbre so that the audio sounds unlike the original

speaker. This may reduce the authenticity of voice messages

to acquaintances and video posts of celebrities.

Voice adversarial examples. Adversarial example attacks

against ASVs add imperceptible noises to audios such that

the ASV cannot recognize the speaker [7,10,25,26,54,57].

Adversarial perturbations can be generated in two ways.

•

Iterative optimization. Optimization-based methods for-

mulate the problem of adversarial perturbation gener-

ation as a constrained optimization problem [7,10,25,

26,57]. As the formulated optimization problems are

usually NP-hard, the solutions can only be approximated

through iterative updates, which is quite time-consuming.

Therefore, iterative optimization methods cannot be ap-

plied to real-time services.

•

One-shot generative model. Generative models can be

trained to produce adversarial perturbations in one shot.

Commonly used generative models include Generative

Adversarial Networks (GAN) and autoencoders [34,35,

46]. As far as we know, there is only one study on gen-

erative model-based adversarial examples against ASV,

named FAPG [54]. However, FAPG mainly focuses on

deceiving ASVs but not preserving intelligibility and

naturalness of audios.

2.5 Threat Model

We deﬁne the threat model in terms of the adversary’s knowl-

edge and capability, then we elaborate the performance goals

of voice anonymization under the deﬁned threat model. As

shown in Figure 2, we consider three kinds of adversaries, i.e.,

ignorant (A1), semi-informed (A2), and informed (A3), with

different knowledge and capabilities.

Knowledge.

The adversary has an anonymized audio

whose speaker is unknown. The adversary has collected a

few clean samples of a pool of potential speakers to help

with identity inference. Adversary A1 does not know that

the audio is anonymized. Adversary A2 knows that the audio

is anonymized but does not know the speciﬁc anonymizer.

Adversary A3 has full knowledge of the anonymizer.

Capability.

Adversary A1, A2, and A3 can use any ASVs

to infer the speaker of the anonymized audio. As shown in

Figure 2, A1 and A2 enroll potential speakers in the ASV

using clean audios, and A3 enrolls potential speakers in the

ASV using audio samples processed by the anonymizer. In the

inference phase, A1 directly feeds the anonymized audio into

the ASV; A2 applies de-noising methods to the anonymized

audio and feeds the de-noised audio into the ASV; A3 also

directly feeds the anonymized audio into the ASV.

In the face of the knowledge and the capability of adver-

saries, we further elaborate the goal of achieving anonymity

regarding different types of adversaries. More speciﬁcally,

in the case of ignorant and semi-informed adversaries, the

speaker of the anonymized audio should be unidentiﬁable,

and in the case of informed adversaries, the speaker and the

anonymized audio should be unlinkable.

Unidentiﬁability.

For A1 and A2 who enroll clean

voiceprints into the ASV, the speaker of an anonymized audio

should not be identiﬁed during the inference phase.

Unlinkability.

For adversary A3 who enrolls anonymizer-

processed voiceprints into the ASV, the speaker of an

anonymized audio should be undistinguishable from other

speakers.

3 Problem Formulation

Before delving into the design details of V-CLOAK, in this

section, we formally formulate the voice anonymization as a

constrained optimization problem.

Given an audio sample

x= [x1,··· ,xD]∈R1×D

, where

R1×D

is a

-dimensional real number ﬁeld, and

is the

length of the audio. Without loss of generality, we assume

xi∈[−1,1]

. We aim to obtain an anonymized audio

˜x

such

that the ASV cannot match the voiceprint of

˜x

with that of

. Let

V:R1×· →R1×N

, denote the voiceprint extraction

function that outputs a voiceprint of a ﬁxed length

, and

G:R1×D→R1×Dodenote the anonymizer function.

Basic Formulation:

min

GLASV

s.t.k˜x−xk∞≤εand x,˜x∈[−1,1],

where

LASV =(SV(˜x),V(x),untargeted anonymization,

−SV(˜x),v,targeted anonymization.

˜x=(G(x),untargeted anonymization,

G(x,v),targeted anonymization.

(1)

where

constrains the

l∞

norm difference between

and

˜x

S(·,·)

is the scoring function measuring the similarity be-

tween the voiceprints of

and

˜x

, and

is the voiceprint of

a speaker other than

. With untargeted anonymization, the

voiceprint of the anonymized audio is diverted from that of the

original audio as much as possible, which guarantees uniden-

tiﬁability, i.e., the voiceprint of the anonymized audio will

not match the voiceprint of the original audio. With targeted

anonymization, the voiceprints of two anonymized audios

with different original speakers but the same target speaker

will both be matched with

(thus be matched together),

which guarantees both unidentiﬁability and unlinkability. We

theoretically analyze the unidentiﬁability and the unlinkabil-

ity of targeted and untargeted anonymizations in Appendix A

and perform corresponding evaluations in §5.

The anonymized audio obtained by Equation (1) satisﬁes

the basic goal of anonymity but may suffer from quality degra-

dation in terms of intelligibility, naturalness and timbre. To

tackle this problem, we equip the basic optimization problem

with loss terms that address the performance goals of intelligi-

bility, naturalness and timbre preservation. More speciﬁcally,

we introduce an ASR-related loss term, which maintains the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

V-CLOAK:Intelligibility-,Naturalness-&Timbre-PreservingReal-TimeVoiceAnonymizationJiangyiDeng1,FeiTeng1,YanjiaoChen1,XiaofuChen2,ZhaohuiWang2,WenyuanXu11ZhejiangUniversity,2WuhanUniversityDemo:https://v-cloak.comAbstractVoicedatageneratedoninstantmessagingorsocialme-diaapplicationscontainsuniqueuse...

展开>> 收起<<

V-C LOAK Intelligibility- Naturalness- Timbre-Preserving Real-Time Voice Anonymization Jiangyi Deng1 Fei Teng1 Yanjiao Chen1 Xiaofu Chen2 Zhaohui Wang2 Wenyuan Xu1.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

V-C LOAK Intelligibility- Naturalness- Timbre-Preserving Real-Time Voice Anonymization Jiangyi Deng1 Fei Teng1 Yanjiao Chen1 Xiaofu Chen2 Zhaohui Wang2 Wenyuan Xu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: