FULL-BAND GENERAL AUDIO SYNTHESIS WITH SCORE-BASED DIFFUSION Santiago Pascual Gautam Bhattacharya Chunghsin Yeh Jordi Pons Joan Serr a Dolby Laboratories

2025-05-06 0 0 470.42KB 5 页 10玖币
侵权投诉
FULL-BAND GENERAL AUDIO SYNTHESIS WITH SCORE-BASED DIFFUSION
Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, Joan Serr`
a
Dolby Laboratories
ABSTRACT
Recent works have shown the capability of deep generative mod-
els to tackle general audio synthesis from a single label, produc-
ing a variety of impulsive, tonal, and environmental sounds. Such
models operate on band-limited signals and, as a result of an au-
toregressive approach, they are typically conformed by pre-trained
latent encoders and/or several cascaded modules. In this work, we
propose a diffusion-based generative model for general audio syn-
thesis, named DAG, which deals with full-band signals end-to-end
in the waveform domain. Results show the superiority of DAG over
existing label-conditioned generators in terms of both quality and
diversity. More specifically, when compared to the state of the art,
the band-limited and full-band versions of DAG achieve relative im-
provements that go up to 40 and 65%, respectively. We believe DAG
is flexible enough to accommodate different conditioning schemas
while providing good quality synthesis.
Index TermsDeep generative models, audio synthesis, full-
band audio, score-based diffusion, Fr´
echet distance.
1. INTRODUCTION
Audio synthesis is the computerized generation of audio sig-
nals. This task has been primarily tackled under a source-specific
paradigm, hence modeling a single type of audio per model. Exam-
ples of this paradigm are text-to-speech [1], music synthesis [24],
or less commonly modeled sources like footsteps [5] or laughter [6],
among others. Nonetheless, a few recent works propose a source-
agnostic paradigm, in which generative models can synthesize
different types of audio with appropriate conditioning injected into
a single neural network. We refer to this as general audio synthesis.
In this context, Kong et al. [7] propose to model environmental
sounds with a class-conditioned SampleRNN. This is a deep autore-
gressive generative model that operates in the time-domain. On the
other hand, Liu et al. [8] propose another autoregressive model ex-
ploiting a VQ-VAE-2 encoder-decoder strategy [9]. In this setup,
a VQ-VAE is trained to create a codebook of audio features ex-
tracted from melspectrograms. Hence, the model incorporates a
downsampling and upsampling of the original signal space into a
discretized latent. Then, a class-conditioned PixelSNAIL autore-
gressive model [10] is built as a language model of these discrete
latent token sequences. An advantage of this cascaded design is that
the expensive autoregressive computation can be alleviated while
working with a lower time-resolution in the latent, thanks to its auto-
encoding design. Since the VQ-VAE operates on melspectrograms,
a HifiGAN [11] trained with general audio content is plugged on top
of the auto-encoder reconstructions to convert them into waveforms.
Other works concurrent to ours tackle more specialized syn-
thesis tasks within the source-agnostic paradigm, like text-to-audio
synthesis (that is, generating audio that is representative of a given
natural language description). This way, following the cascaded
framework of Liu et al., Yang et al. [12] propose DiffSound, a text-
conditioned diffusion probabilistic model that replaces the autore-
gressive PixelSNAIL and improves generation quality and speed.
In addition, Kreuk et al. [13] propose AudioGen, another cascaded
model for text-to-audio in which a discrete latent is learned from
raw waveforms and an attention-based autoregressive decoder gen-
erates the discrete latent stream. Note that these cascaded and/or
autoregressive designs impose a limitation in the bandwidth of the
generated signal. For instance, a component in the cascade can be a
band-limited bottleneck. Alternatively, the latent discretization im-
posed by autoregressive modeling, or the autoregressive design in
the raw signal space itself, can lead to sampling rate limitations. This
happens to avoid a large quality loss in the compression for the for-
mer case, or a high computational inefficiency due to long sequence
generations in the latter case.
Despite various advancements on general audio synthesis, we ar-
gue that state-of-the-art methods are limited due to (i) targeting audio
content below 11 kHz bandwidth, (ii) reusing previous (and some-
times pre-trained for a different task) modules in a complex cascaded
framework, and (iii) quantizing latents with potential quality loss. In
this work, we propose the diffusion audio generator (DAG), a full-
band end-to-end source-agnostic waveform synthesizer. While pre-
vious works are band-limited due to modeling constraints, DAG is
built upon a lossless auto-encoder that can directly generate 48 kHz
waveforms (24 kHz bandwidth). In addition, its end-to-end design
makes it simpler to train and use, avoiding intermediate information
bottlenecks or possible cumulative errors from one module to the
next one. Furthermore, DAG is built upon a score-based diffusion
generative paradigm [14,15], which has shown great performance
in related fields like speech synthesis [16,17], universal speech en-
hancement [18], or source-specific audio synthesis [3]. Our results
show that DAG is capable of generating higher quality content while
being a simpler and more parameter-efficient approach. This is evi-
denced from relative improvements up to 40 and 65% with respect to
the state of the art, depending on whether band-limited or full-band
signals are generated. Besides the empirical results reported here,
we also provide additional audio samples to showcase some of the
possibilities that our solution offers and to highlight its potential.
2. DIFFUSION AUDIO GENERATOR
We propose a deep generative audio model based on score-matching
with variance exploding diffusion [14,15]. The generator, which
follows the denoising score matching strategy from the speech syn-
thesis work [18], is trained to minimize
LSCORE =EtEztEx01
2
σt˜
S(x0+σtzt,c, σt) + zt
2
2,
where t∼ U(0,1),zt N (0,I),x0pdata,˜
Sis the generated
score, cis the conditioning signal, and σtvalues follow a geometric
arXiv:2210.14661v1 [cs.SD] 26 Oct 2022
摘要:

FULL-BANDGENERALAUDIOSYNTHESISWITHSCORE-BASEDDIFFUSIONSantiagoPascual,GautamBhattacharya,ChunghsinYeh,JordiPons,JoanSerraDolbyLaboratoriesABSTRACTRecentworkshaveshownthecapabilityofdeepgenerativemod-elstotacklegeneralaudiosynthesisfromasinglelabel,produc-ingavarietyofimpulsive,tonal,andenvironmenta...

展开>> 收起<<
FULL-BAND GENERAL AUDIO SYNTHESIS WITH SCORE-BASED DIFFUSION Santiago Pascual Gautam Bhattacharya Chunghsin Yeh Jordi Pons Joan Serr a Dolby Laboratories.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:470.42KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注