
SOLVING AUDIO INVERSE PROBLEMS WITH A DIFFUSION MODEL
Eloi Moliner1Jaakko Lehtinen2Vesa Välimäki1∗
1Acoustics Lab, Dept. of Information & Communications Eng. 2Dept. of Computer Science
Aalto University, Espoo, Finland
eloi.moliner@aalto.fi
ABSTRACT
This paper presents CQT-Diff, a data-driven generative audio model
that can, once trained, be used for solving various different au-
dio inverse problems in a problem-agnostic setting. CQT-Diff is a
neural diffusion model with an architecture that is carefully con-
structed to exploit pitch-equivariant symmetries in music. This is
achieved by preconditioning the model with an invertible Constant-
Q Transform (CQT), whose logarithmically-spaced frequency axis
represents pitch equivariance as translation equivariance. The pro-
posed method is evaluated with solo piano music, using objective
and subjective metrics in three different and varied tasks: audio
bandwidth extension, inpainting, and declipping. The results show
that CQT-Diff outperforms the compared baselines and ablations in
audio bandwidth extension and, without retraining, delivers compet-
itive performance against modern baselines in audio inpainting and
declipping. This work represents the first diffusion-based general
framework for solving inverse problems in audio processing.
Index Terms—Audio systems, deep learning, inverse problems,
signal restoration.
1. INTRODUCTION
Audio restoration tasks, such as bandwidth extension, inpainting,
and declipping, are inverse problems with the aim to restore the sig-
nal from observations that have suffered a known type of degrada-
tion. These problems are ill-posed in the sense that several differ-
ent restorations may be equally plausible. Algorithms are typically
constructed for a particular type of degradation using domain knowl-
edge, such as signal sparsity [1, 2], or the low-rankness of the power
spectrum [3]. Data-driven generative models have also been recently
proposed [4, 5]. A common shortcoming is that an algorithm engi-
neered for one type of degradation is typically not useful for others.
In this work, we explore whether a single pre-trained generative
audio model could be useful for different restoration tasks without
knowledge of which types of degradations it will be applied to at in-
ference time. We base our work on diffusion models [6, 7], a family
of generative models that have shown outstanding performance in
different modalities, such as image [8], video [9], speech [10, 11],
and audio generation [12, 13]. Particularly relevant to us is that un-
conditional (purely generative) diffusion models have proven useful
also in conditional restoration settings where parts of the result are
pre-determined, particularly in the image domain [14, 15, 16]. Cru-
cially, as these approaches do not require paired data for training
and the degradation model is only used during inference, the models
∗This research is part of the activities of the Nordic Sound and Music
Computing Network—NordicSMC (NordForsk project no. 86892).
adapt to a new problem without retraining. To the best of our knowl-
edge, analogous methods have not been employed in audio audio
restoration. This paper addresses this research gap.
Training an unconditional audio diffusion model is the first step
towards this objective. We argue that this is overly challenging un-
less suitable inductive biases are built in to the model. While un-
conditional audio generation has been demonstrated using a raw au-
dio waveform representation, the experiments are limited to small
speech datasets [10, 17], or non-tonal audio [12, 18], with perfor-
mance expected to decrease for more challenging datasets. Build-
ing diffusion models using mel-spectrograms or magnitude spectro-
grams [13] leads to a more expressive generative performance. How-
ever, the lack of invertibility of these transforms blocks the required
conditioning for solving inverse problems. We design our diffusion
model in the time domain, allowing for maximum flexibility, and use
an invertible time-frequency transform as a preprocessing step.
Choosing the right transform plays a critical role in the perfor-
mance of the model. One option is to use a Short-Time Fourier
Transform (STFT) and apply a neural network based on 2-D con-
volutions, adapted from image generation. While this approach suc-
ceeded for speech [19], we argue that, contrary to image processing,
it is non-ideal for signals with strong harmonics (music), as trans-
lation equivariance of harmonics does not hold true in linear fre-
quency. The Constant-Q Transform (CQT), in which the frequency
axis is logarithmically spaced, is a possible solution to this problem.
Since pitch-shifting corresponds to translation in the CQT spectro-
gram, applying the convolutional operator is now appropriate. The
CQT is widely used in Music Information Retrieval [20], but, despite
some exceptions [21], is not common in audio generation tasks.
We first show that CQT preconditioning yields a high-quality
diffusion model for audio. We present a versatile framework called
CQT-Diff for solving audio restoration problems, from which we
evaluate three: bandwidth extension, audio inpainting, and declip-
ping. Pitted against specialized algorithms, our model yields state-
of-the art or strongly competitive performance in listening tests.
2. CQT-DIFF FRAMEWORK
2.1. Score-based Generative Modeling
Consider a diffusion process where data x0∼pdata is progressively
diffused into Gaussian noise xτmax ∼ N(0, σ2
maxI)over time1τby
applying a perturbation kernel pt(xτ) = N(xτ;x0, σ2
τI). Diffu-
sion models, also known as score-based models, generate data sam-
ples by reversing the aforementioned diffusion process. Specifically,
diffusion models estimate the gradient of the log probability density
with respect to data ∇xτlog pτ(xτ), known as the score function.
1The “diffusion time” τmust not be confused with the “audio time” t.
arXiv:2210.15228v3 [eess.AS] 18 Mar 2023