SOLVING AUDIO INVERSE PROBLEMS WITH A DIFFUSION MODEL Eloi Moliner1Jaakko Lehtinen2Vesa Välimäki1 1Acoustics Lab Dept. of Information Communications Eng.2Dept. of Computer Science

2025-05-03 0 0 467.76KB 5 页 10玖币
侵权投诉
SOLVING AUDIO INVERSE PROBLEMS WITH A DIFFUSION MODEL
Eloi Moliner1Jaakko Lehtinen2Vesa Välimäki1
1Acoustics Lab, Dept. of Information & Communications Eng. 2Dept. of Computer Science
Aalto University, Espoo, Finland
eloi.moliner@aalto.fi
ABSTRACT
This paper presents CQT-Diff, a data-driven generative audio model
that can, once trained, be used for solving various different au-
dio inverse problems in a problem-agnostic setting. CQT-Diff is a
neural diffusion model with an architecture that is carefully con-
structed to exploit pitch-equivariant symmetries in music. This is
achieved by preconditioning the model with an invertible Constant-
Q Transform (CQT), whose logarithmically-spaced frequency axis
represents pitch equivariance as translation equivariance. The pro-
posed method is evaluated with solo piano music, using objective
and subjective metrics in three different and varied tasks: audio
bandwidth extension, inpainting, and declipping. The results show
that CQT-Diff outperforms the compared baselines and ablations in
audio bandwidth extension and, without retraining, delivers compet-
itive performance against modern baselines in audio inpainting and
declipping. This work represents the first diffusion-based general
framework for solving inverse problems in audio processing.
Index TermsAudio systems, deep learning, inverse problems,
signal restoration.
1. INTRODUCTION
Audio restoration tasks, such as bandwidth extension, inpainting,
and declipping, are inverse problems with the aim to restore the sig-
nal from observations that have suffered a known type of degrada-
tion. These problems are ill-posed in the sense that several differ-
ent restorations may be equally plausible. Algorithms are typically
constructed for a particular type of degradation using domain knowl-
edge, such as signal sparsity [1, 2], or the low-rankness of the power
spectrum [3]. Data-driven generative models have also been recently
proposed [4, 5]. A common shortcoming is that an algorithm engi-
neered for one type of degradation is typically not useful for others.
In this work, we explore whether a single pre-trained generative
audio model could be useful for different restoration tasks without
knowledge of which types of degradations it will be applied to at in-
ference time. We base our work on diffusion models [6, 7], a family
of generative models that have shown outstanding performance in
different modalities, such as image [8], video [9], speech [10, 11],
and audio generation [12, 13]. Particularly relevant to us is that un-
conditional (purely generative) diffusion models have proven useful
also in conditional restoration settings where parts of the result are
pre-determined, particularly in the image domain [14, 15, 16]. Cru-
cially, as these approaches do not require paired data for training
and the degradation model is only used during inference, the models
This research is part of the activities of the Nordic Sound and Music
Computing Network—NordicSMC (NordForsk project no. 86892).
adapt to a new problem without retraining. To the best of our knowl-
edge, analogous methods have not been employed in audio audio
restoration. This paper addresses this research gap.
Training an unconditional audio diffusion model is the first step
towards this objective. We argue that this is overly challenging un-
less suitable inductive biases are built in to the model. While un-
conditional audio generation has been demonstrated using a raw au-
dio waveform representation, the experiments are limited to small
speech datasets [10, 17], or non-tonal audio [12, 18], with perfor-
mance expected to decrease for more challenging datasets. Build-
ing diffusion models using mel-spectrograms or magnitude spectro-
grams [13] leads to a more expressive generative performance. How-
ever, the lack of invertibility of these transforms blocks the required
conditioning for solving inverse problems. We design our diffusion
model in the time domain, allowing for maximum flexibility, and use
an invertible time-frequency transform as a preprocessing step.
Choosing the right transform plays a critical role in the perfor-
mance of the model. One option is to use a Short-Time Fourier
Transform (STFT) and apply a neural network based on 2-D con-
volutions, adapted from image generation. While this approach suc-
ceeded for speech [19], we argue that, contrary to image processing,
it is non-ideal for signals with strong harmonics (music), as trans-
lation equivariance of harmonics does not hold true in linear fre-
quency. The Constant-Q Transform (CQT), in which the frequency
axis is logarithmically spaced, is a possible solution to this problem.
Since pitch-shifting corresponds to translation in the CQT spectro-
gram, applying the convolutional operator is now appropriate. The
CQT is widely used in Music Information Retrieval [20], but, despite
some exceptions [21], is not common in audio generation tasks.
We first show that CQT preconditioning yields a high-quality
diffusion model for audio. We present a versatile framework called
CQT-Diff for solving audio restoration problems, from which we
evaluate three: bandwidth extension, audio inpainting, and declip-
ping. Pitted against specialized algorithms, our model yields state-
of-the art or strongly competitive performance in listening tests.
2. CQT-DIFF FRAMEWORK
2.1. Score-based Generative Modeling
Consider a diffusion process where data x0pdata is progressively
diffused into Gaussian noise xτmax ∼ N(0, σ2
maxI)over time1τby
applying a perturbation kernel pt(xτ) = N(xτ;x0, σ2
τI). Diffu-
sion models, also known as score-based models, generate data sam-
ples by reversing the aforementioned diffusion process. Specifically,
diffusion models estimate the gradient of the log probability density
with respect to data xτlog pτ(xτ), known as the score function.
1The “diffusion time” τmust not be confused with the “audio time” t.
arXiv:2210.15228v3 [eess.AS] 18 Mar 2023
摘要:

SOLVINGAUDIOINVERSEPROBLEMSWITHADIFFUSIONMODELEloiMoliner1JaakkoLehtinen2VesaVälimäki11AcousticsLab,Dept.ofInformation&CommunicationsEng.2Dept.ofComputerScienceAaltoUniversity,Espoo,Finlandeloi.moliner@aalto.ABSTRACTThispaperpresentsCQT-Diff,adata-drivengenerativeaudiomodelthatcan,oncetrained,beus...

展开>> 收起<<
SOLVING AUDIO INVERSE PROBLEMS WITH A DIFFUSION MODEL Eloi Moliner1Jaakko Lehtinen2Vesa Välimäki1 1Acoustics Lab Dept. of Information Communications Eng.2Dept. of Computer Science.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:467.76KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注