FULL-BAND GENERAL AUDIO SYNTHESIS WITH SCORE-BASED DIFFUSION Santiago Pascual Gautam Bhattacharya Chunghsin Yeh Jordi Pons Joan Serr a Dolby Laboratories

2025-05-06 0 0 470.42KB 5 页 10玖币

侵权投诉

FULL-BAND GENERAL AUDIO SYNTHESIS WITH SCORE-BASED DIFFUSION

Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, Joan Serr`

Dolby Laboratories

ABSTRACT

Recent works have shown the capability of deep generative mod-

els to tackle general audio synthesis from a single label, produc-

ing a variety of impulsive, tonal, and environmental sounds. Such

models operate on band-limited signals and, as a result of an au-

toregressive approach, they are typically conformed by pre-trained

latent encoders and/or several cascaded modules. In this work, we

propose a diffusion-based generative model for general audio syn-

thesis, named DAG, which deals with full-band signals end-to-end

in the waveform domain. Results show the superiority of DAG over

existing label-conditioned generators in terms of both quality and

diversity. More speciﬁcally, when compared to the state of the art,

the band-limited and full-band versions of DAG achieve relative im-

provements that go up to 40 and 65%, respectively. We believe DAG

is ﬂexible enough to accommodate different conditioning schemas

while providing good quality synthesis.

Index Terms—Deep generative models, audio synthesis, full-

band audio, score-based diffusion, Fr´

echet distance.

1. INTRODUCTION

Audio synthesis is the computerized generation of audio sig-

nals. This task has been primarily tackled under a source-speciﬁc

paradigm, hence modeling a single type of audio per model. Exam-

ples of this paradigm are text-to-speech [1], music synthesis [2–4],

or less commonly modeled sources like footsteps [5] or laughter [6],

among others. Nonetheless, a few recent works propose a source-

agnostic paradigm, in which generative models can synthesize

different types of audio with appropriate conditioning injected into

a single neural network. We refer to this as general audio synthesis.

In this context, Kong et al. [7] propose to model environmental

sounds with a class-conditioned SampleRNN. This is a deep autore-

gressive generative model that operates in the time-domain. On the

other hand, Liu et al. [8] propose another autoregressive model ex-

ploiting a VQ-VAE-2 encoder-decoder strategy [9]. In this setup,

a VQ-VAE is trained to create a codebook of audio features ex-

tracted from melspectrograms. Hence, the model incorporates a

downsampling and upsampling of the original signal space into a

discretized latent. Then, a class-conditioned PixelSNAIL autore-

gressive model [10] is built as a language model of these discrete

latent token sequences. An advantage of this cascaded design is that

the expensive autoregressive computation can be alleviated while

working with a lower time-resolution in the latent, thanks to its auto-

encoding design. Since the VQ-VAE operates on melspectrograms,

a HiﬁGAN [11] trained with general audio content is plugged on top

of the auto-encoder reconstructions to convert them into waveforms.

Other works concurrent to ours tackle more specialized syn-

thesis tasks within the source-agnostic paradigm, like text-to-audio

synthesis (that is, generating audio that is representative of a given

natural language description). This way, following the cascaded

framework of Liu et al., Yang et al. [12] propose DiffSound, a text-

conditioned diffusion probabilistic model that replaces the autore-

gressive PixelSNAIL and improves generation quality and speed.

In addition, Kreuk et al. [13] propose AudioGen, another cascaded

model for text-to-audio in which a discrete latent is learned from

raw waveforms and an attention-based autoregressive decoder gen-

erates the discrete latent stream. Note that these cascaded and/or

autoregressive designs impose a limitation in the bandwidth of the

generated signal. For instance, a component in the cascade can be a

band-limited bottleneck. Alternatively, the latent discretization im-

posed by autoregressive modeling, or the autoregressive design in

the raw signal space itself, can lead to sampling rate limitations. This

happens to avoid a large quality loss in the compression for the for-

mer case, or a high computational inefﬁciency due to long sequence

generations in the latter case.

Despite various advancements on general audio synthesis, we ar-

gue that state-of-the-art methods are limited due to (i) targeting audio

content below 11 kHz bandwidth, (ii) reusing previous (and some-

times pre-trained for a different task) modules in a complex cascaded

framework, and (iii) quantizing latents with potential quality loss. In

this work, we propose the diffusion audio generator (DAG), a full-

band end-to-end source-agnostic waveform synthesizer. While pre-

vious works are band-limited due to modeling constraints, DAG is

built upon a lossless auto-encoder that can directly generate 48 kHz

waveforms (24 kHz bandwidth). In addition, its end-to-end design

makes it simpler to train and use, avoiding intermediate information

bottlenecks or possible cumulative errors from one module to the

next one. Furthermore, DAG is built upon a score-based diffusion

generative paradigm [14,15], which has shown great performance

in related ﬁelds like speech synthesis [16,17], universal speech en-

hancement [18], or source-speciﬁc audio synthesis [3]. Our results

show that DAG is capable of generating higher quality content while

being a simpler and more parameter-efﬁcient approach. This is evi-

denced from relative improvements up to 40 and 65% with respect to

the state of the art, depending on whether band-limited or full-band

signals are generated. Besides the empirical results reported here,

we also provide additional audio samples to showcase some of the

possibilities that our solution offers and to highlight its potential.

2. DIFFUSION AUDIO GENERATOR

We propose a deep generative audio model based on score-matching

with variance exploding diffusion [14,15]. The generator, which

follows the denoising score matching strategy from the speech syn-

thesis work [18], is trained to minimize

LSCORE =EtEztEx01

2



σt˜

S(x0+σtzt,c, σt) + zt



2,

where t∼ U(0,1),zt∼ N (0,I),x0∼pdata,˜

Sis the generated

score, cis the conditioning signal, and σtvalues follow a geometric

arXiv:2210.14661v1 [cs.SD] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FULL-BANDGENERALAUDIOSYNTHESISWITHSCORE-BASEDDIFFUSIONSantiagoPascual,GautamBhattacharya,ChunghsinYeh,JordiPons,JoanSerraDolbyLaboratoriesABSTRACTRecentworkshaveshownthecapabilityofdeepgenerativemod-elstotacklegeneralaudiosynthesisfromasinglelabel,produc-ingavarietyofimpulsive,tonal,andenvironmenta...

展开>> 收起<<

FULL-BAND GENERAL AUDIO SYNTHESIS WITH SCORE-BASED DIFFUSION Santiago Pascual Gautam Bhattacharya Chunghsin Yeh Jordi Pons Joan Serr a Dolby Laboratories.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FULL-BAND GENERAL AUDIO SYNTHESIS WITH SCORE-BASED DIFFUSION Santiago Pascual Gautam Bhattacharya Chunghsin Yeh Jordi Pons Joan Serr a Dolby Laboratories

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: