
Method Loss (↓) Acc. (↑)
Random search 0.861 0.068
Gradient (Adam) 1.553 0.027
Variational optimization 0.891 0.034
Genetic algorithm 0.633 0.114
Differential evolution 0.738 0.068
PGPE [23] 1.000 0.011
CMA-ES [12] 0.891 0.027
Metropolis MCMC 0.658 0.091
Bayesian optimization [31] 0.737 0.034
Encoder 2.131 0.034
Original — 0.841
Table 1: Comparison of optimization methods
Figure 2: Plot of parameters per category. Param-
eters are normalized to the [0,1] range.
generator, whose pitch and amplitude envelopes are modulated by a mix of low-frequency oscillators
(LFOs). For any non-trivial synthesizer, it is extremely difficult to adjust parameters manually to
reproduce the target sound, necessitating automated optimization techniques.
We first compare several optimization techniques for parameter inference. As the target sound dataset,
we used the animal sounds (e.g., dog, cat, birds) in the first fold of ESC-50 [
22
], an environmental
sound dataset. We used the multi-scale STFT loss implemented in auraloss [
25
] as the objective. In
order to assess the quality of reconstruction, we fine-tuned the pretrained VGGish [
14
] model on the
remaining four folds of ESC-50 on the 50-class sound classification task (which includes both animal
and non-animal categories) and used its accuracy. Intuitively, if an optimization method achieves
good reconstruction, it can also “fool” the classifier trained on real sounds.
Table 1 shows the optimization techniques and their metrics. See the Supplemental Information for the
details of individual methods. Overall, evolutionary algorithms (the genetic algorithm and differential
evolution) worked relatively well, while gradient-based methods performed poorly, due to the fact that
the objective function is highly nonlinear and complex with respect to synthesizer parameters. This
trend is also observed for other highly complex optimization problems [
26
]. Predicting parameters
directly with a neural network encoder was not successful either, possibly due to the discrepancies
between the real sounds and the training data (artificially generated sound by TorchSynth itself).
One potential way to improve the gradient-based methods is to use a proxy network that simulates the
synthesizer, and replace the original synthesizer with the proxy network in parameter inference [
24
].
We trained a MelGAN [17]-based proxy network that takes the TorchSynth parameters as input and
the corresponding waveform as the target. We found that the reconstructions were poor. In contrast to
audio effects, converting 78 parameters to a 2-second waveform (88,200 samples) may have its own
challenges due to the large difference in the scales of the input and output.
After inferring parameters for each sample/class, controlling and generating new sounds are trivial.
For example, one can change pitch, de-noise, and/or modify the amplitude envelope by changing just
one or few synthesizer parameters as shown in the spectrograms in Figure 1.
3 Interpretation and Generation
One benefit of synthesizer parameter inference is that the fitted parameters are interpretable. In
Figure 2, we choose three animal categories (cow, pig, cat) for which the classifier accuracy is
relatively high and plot the fundamental frequency (f0) and the duration of the note inferred by the
genetic algorithm. The sound instances cluster based on their acoustic properties, i.e., cats make short
and high-pitched sounds while cow vocalizations are longer and lower-pitched.
Finally, as a proof of concept, we generated new cat sounds by first fitting a Gaussian distribution over
the inferred cat parameters (green squares in Figure 2) in the unnormalized space, then generating 100
new sounds by sampling from this distribution. The fine-tuned VGGish model correctly classified 18
samples out of 100, demonstrating that we can generate plausible new sounds by sampling synthesizer
parameters from the fitted distribution.
2