SOLVING AUDIO INVERSE PROBLEMS WITH A DIFFUSION MODEL Eloi Moliner1Jaakko Lehtinen2Vesa Välimäki1 1Acoustics Lab Dept. of Information Communications Eng.2Dept. of Computer Science

2025-05-03 0 0 467.76KB 5 页 10玖币

侵权投诉

SOLVING AUDIO INVERSE PROBLEMS WITH A DIFFUSION MODEL

Eloi Moliner1Jaakko Lehtinen2Vesa Välimäki1∗

1Acoustics Lab, Dept. of Information & Communications Eng. 2Dept. of Computer Science

Aalto University, Espoo, Finland

eloi.moliner@aalto.ﬁ

ABSTRACT

This paper presents CQT-Diff, a data-driven generative audio model

that can, once trained, be used for solving various different au-

dio inverse problems in a problem-agnostic setting. CQT-Diff is a

neural diffusion model with an architecture that is carefully con-

structed to exploit pitch-equivariant symmetries in music. This is

achieved by preconditioning the model with an invertible Constant-

Q Transform (CQT), whose logarithmically-spaced frequency axis

represents pitch equivariance as translation equivariance. The pro-

posed method is evaluated with solo piano music, using objective

and subjective metrics in three different and varied tasks: audio

bandwidth extension, inpainting, and declipping. The results show

that CQT-Diff outperforms the compared baselines and ablations in

audio bandwidth extension and, without retraining, delivers compet-

itive performance against modern baselines in audio inpainting and

declipping. This work represents the ﬁrst diffusion-based general

framework for solving inverse problems in audio processing.

Index Terms—Audio systems, deep learning, inverse problems,

signal restoration.

1. INTRODUCTION

Audio restoration tasks, such as bandwidth extension, inpainting,

and declipping, are inverse problems with the aim to restore the sig-

nal from observations that have suffered a known type of degrada-

tion. These problems are ill-posed in the sense that several differ-

ent restorations may be equally plausible. Algorithms are typically

constructed for a particular type of degradation using domain knowl-

edge, such as signal sparsity [1, 2], or the low-rankness of the power

spectrum [3]. Data-driven generative models have also been recently

proposed [4, 5]. A common shortcoming is that an algorithm engi-

neered for one type of degradation is typically not useful for others.

In this work, we explore whether a single pre-trained generative

audio model could be useful for different restoration tasks without

knowledge of which types of degradations it will be applied to at in-

ference time. We base our work on diffusion models [6, 7], a family

of generative models that have shown outstanding performance in

different modalities, such as image [8], video [9], speech [10, 11],

and audio generation [12, 13]. Particularly relevant to us is that un-

conditional (purely generative) diffusion models have proven useful

also in conditional restoration settings where parts of the result are

pre-determined, particularly in the image domain [14, 15, 16]. Cru-

cially, as these approaches do not require paired data for training

and the degradation model is only used during inference, the models

∗This research is part of the activities of the Nordic Sound and Music

Computing Network—NordicSMC (NordForsk project no. 86892).

adapt to a new problem without retraining. To the best of our knowl-

edge, analogous methods have not been employed in audio audio

restoration. This paper addresses this research gap.

Training an unconditional audio diffusion model is the ﬁrst step

towards this objective. We argue that this is overly challenging un-

less suitable inductive biases are built in to the model. While un-

conditional audio generation has been demonstrated using a raw au-

dio waveform representation, the experiments are limited to small

speech datasets [10, 17], or non-tonal audio [12, 18], with perfor-

mance expected to decrease for more challenging datasets. Build-

ing diffusion models using mel-spectrograms or magnitude spectro-

grams [13] leads to a more expressive generative performance. How-

ever, the lack of invertibility of these transforms blocks the required

conditioning for solving inverse problems. We design our diffusion

model in the time domain, allowing for maximum ﬂexibility, and use

an invertible time-frequency transform as a preprocessing step.

Choosing the right transform plays a critical role in the perfor-

mance of the model. One option is to use a Short-Time Fourier

Transform (STFT) and apply a neural network based on 2-D con-

volutions, adapted from image generation. While this approach suc-

ceeded for speech [19], we argue that, contrary to image processing,

it is non-ideal for signals with strong harmonics (music), as trans-

lation equivariance of harmonics does not hold true in linear fre-

quency. The Constant-Q Transform (CQT), in which the frequency

axis is logarithmically spaced, is a possible solution to this problem.

Since pitch-shifting corresponds to translation in the CQT spectro-

gram, applying the convolutional operator is now appropriate. The

CQT is widely used in Music Information Retrieval [20], but, despite

some exceptions [21], is not common in audio generation tasks.

We ﬁrst show that CQT preconditioning yields a high-quality

diffusion model for audio. We present a versatile framework called

CQT-Diff for solving audio restoration problems, from which we

evaluate three: bandwidth extension, audio inpainting, and declip-

ping. Pitted against specialized algorithms, our model yields state-

of-the art or strongly competitive performance in listening tests.

2. CQT-DIFF FRAMEWORK

2.1. Score-based Generative Modeling

Consider a diffusion process where data x0∼pdata is progressively

diffused into Gaussian noise xτmax ∼ N(0, σ2

maxI)over time1τby

applying a perturbation kernel pt(xτ) = N(xτ;x0, σ2

τI). Diffu-

sion models, also known as score-based models, generate data sam-

ples by reversing the aforementioned diffusion process. Speciﬁcally,

diffusion models estimate the gradient of the log probability density

with respect to data ∇xτlog pτ(xτ), known as the score function.

1The “diffusion time” τmust not be confused with the “audio time” t.

arXiv:2210.15228v3 [eess.AS] 18 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SOLVINGAUDIOINVERSEPROBLEMSWITHADIFFUSIONMODELEloiMoliner1JaakkoLehtinen2VesaVälimäki11AcousticsLab,Dept.ofInformation&CommunicationsEng.2Dept.ofComputerScienceAaltoUniversity,Espoo,Finlandeloi.moliner@aalto.ABSTRACTThispaperpresentsCQT-Diff,adata-drivengenerativeaudiomodelthatcan,oncetrained,beus...

展开>> 收起<<

SOLVING AUDIO INVERSE PROBLEMS WITH A DIFFUSION MODEL Eloi Moliner1Jaakko Lehtinen2Vesa Välimäki1 1Acoustics Lab Dept. of Information Communications Eng.2Dept. of Computer Science.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SOLVING AUDIO INVERSE PROBLEMS WITH A DIFFUSION MODEL Eloi Moliner1Jaakko Lehtinen2Vesa Välimäki1 1Acoustics Lab Dept. of Information Communications Eng.2Dept. of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: