
WAVEFIT: AN ITERATIVE AND NON-AUTOREGRESSIVE NEURAL VOCODER
BASED ON FIXED-POINT ITERATION
Yuma Koizumi1, Kohei Yatabe2, Heiga Zen1, Michiel Bacchiani1
1Google Research, Japan 2Tokyo University of Agriculture and Technology, Japan
ABSTRACT
Denoising diffusion probabilistic models (DDPMs) and generative
adversarial networks (GANs) are popular generative models for neu-
ral vocoders. The DDPMs and GANs can be characterized by the
iterative denoising framework and adversarial training, respectively.
This study proposes a fast and high-quality neural vocoder called
WaveFit, which integrates the essence of GANs into a DDPM-like
iterative framework based on fixed-point iteration. WaveFit itera-
tively denoises an input signal, and trains a deep neural network
(DNN) for minimizing an adversarial loss calculated from interme-
diate outputs at all iterations. Subjective (side-by-side) listening tests
showed no statistically significant differences in naturalness between
human natural speech and those synthesized by WaveFit with five it-
erations. Furthermore, the inference speed of WaveFit was more
than 240 times faster than WaveRNN. Audio demos are available at
google.github.io/df-conformer/wavefit/.
Index Terms—Neural vocoder, fixed-point iteration, generative
adversarial networks, denoising diffusion probabilistic models.
1. INTRODUCTION
Neural vocoders [1–4] are artificial neural networks that generate a
speech waveform given acoustic features. They are indispensable
building blocks of recent applications of speech generation. For
example, they are used as the backbone module in text-to-speech
(TTS) [5–10], voice conversion [11, 12], speech-to-speech trans-
lation (S2ST) [13–15], speech enhancement (SE) [16–19], speech
restoration [20, 21], and speech coding [22–25]. Autoregressive
(AR) models first revolutionized the quality of speech genera-
tion [1, 26–28]. However, as they require a large number of sequen-
tial operations for generation, parallelizing the computation is not
trivial thus their processing time is sometimes far longer than the
duration of the output signals.
To speed up the inference, non-AR models have gained a lot
of attention thanks to their parallelization-friendly model architec-
tures. Early successful studies of non-AR models are those based
on normalizing flows [3, 4, 29] which convert an input noise to a
speech using stacked invertible deep neural networks (DNNs) [30].
In the last few years, the approach using generative adversarial net-
works (GANs) [31] is the most successful non-AR strategy [32–41]
where they are trained to generate speech waveforms indistinguish-
able from human natural speech by discriminator networks. The lat-
est member of the generative models for neural vocoders is the de-
noising diffusion probabilistic model (DDPM) [42–49], which con-
verts a random noise into a speech waveform by the iterative sam-
pling process as illustrated in Fig. 1 (a). With hundreds of iterations,
DDPMs can generate speech waveforms comparable to those of AR
models [42, 43].
Noise generation
GAN
Loss
Output
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>
y0
<latexit sha1_base64="feNpNBS1NA6fRrK97NgwcBF5KBw=">AAACl3ichVHNLsRQGD3qf/wNNsRGTIiNyVcRxIaQiKXBIDEyaeui0b+0dyYZzbyAB2Bh4SexEI9gaeMFLDyCWJLYWPjaaSIIvqa95557znd78umeZQaS6LFOqW9obGpuaU21tXd0dqW7e9YDt+QbIm+4lutv6logLNMReWlKS2x6vtBs3RIb+sFCdL5RFn5gus6arHhi29b2HHPXNDTJVKGg22GlWgzlmFotpjOUpbgGfwI1ARkkteymb1HADlwYKMGGgAPJ2IKGgJ8tqCB4zG0jZM5nZMbnAlWk2FtilWCFxuwBf/d4t5WwDu+jnkHsNvgWi1+fnYMYpge6phe6pxt6ovdfe4Vxj+hfKrzqNa/wil1Hfatv/7psXiX2P11/OHRW/6+LsknsYjrOZHJGL2aitEbNXz48eVmdWRkOR+iSnjnnBT3SHSd1yq/GVU6snCLFg1K/j+UnWB/PqpNZyk1k5uaTkbVgAEMY5blMYQ5LWEae7/VwjDOcK/3KrLKoLNWkSl3i6cWXUnIfWkyWYw==</latexit>
yt1
<latexit sha1_base64="OspnX+yHSbX7qkurs5eErJTXzw8=">AAACk3ichVE9S8NQFD3G7/pVFUFwEUvFqdyKqOhS1MFF0Go/oEpJ4qsG80XyWqjFP+Do4qCLgoP4Exxd/AMO/gRxVHBx8CYNiIp6Q96777xzbnI4mmsaviR6bFFa29o7Oru6Yz29ff0D8cGhvO9UPV3kdMd0vKKm+sI0bJGThjRF0fWEammmKGgHy8F9oSY833DsLVl3xY6l7tlGxdBVyVBxW7Ma9aOyLMcTlKKwxn826ahJIKp1J36LbezCgY4qLAjYkNybUOHzU0IaBJexHTQY87gzwnuBI8RYW2WWYIbK6AGve3wqRajN52CmH6p1/orJr8fKcSTpga7phe7php7o/ddZjXBG8C913rWmVrjlgePRzbd/VRbvEvufqj8UGrP/5wXeJCqYDz0Z7NENkcCt3tTXDk9fNheyycYkXdIz+7ygR7pjp3btVb/aENkzxDio9PdYfjb56VR6NkUbM4nMUhRZF8YwgSnOZQ4ZrGIduTCPE5zhXBlRFpUlZaVJVVoizTC+lLL2AdJ9lOU=</latexit>
yt
<latexit sha1_base64="IoBXzpB/i/H3onkl9rKzdMt4ZYs=">AAACk3ichVE9S8NQFD3Gr1q/qiIILmJRnMqtiIoupTq4CK21WmilJPFVQ/NFkhZq8Q84ujjoouAg/gRHF/+Agz9BHCu4OHiTBkRFvSHv3XfeOTc5HMXWNdcjeuqQOru6e3ojfdH+gcGh4djI6I5r1RxV5FVLt5yCIrtC10yR9zRPFwXbEbKh6GJXqa7597t14biaZW57DVvsGfKBqVU0VfYYKpQUo9k4Lm+XY3FKUFBTP5tk2MQRVsaK3aGEfVhQUYMBARMe9zpkuPwUkQTBZmwPTcYc7rTgXuAYUdbWmCWYITNa5fWAT8UQNfnsz3QDtcpf0fl1WDmFGXqkG2rRA93SM73/OqsZzPD/pcG70tYKuzx8MpF7+1dl8O7h8FP1h0Jh9v8835uHCpYDTxp7tAPEd6u29fWjs1ZuZWumOUtX9MI+L+mJ7tmpWX9Vr7Ni6xxRDir5PZafzc58IrmYoOxCPJUOI4tgEtOY41yWkMIGMsgHeZziHBfSuLQqpaX1NlXqCDVj+FLS5geNfZTF</latexit>
yT
<latexit sha1_base64="vkOcqP3l5iv4PNLUqQWZXpBdbVs=">AAAClXichVHLSsNQFJzGV62vqgsFN2JRXJVTERVBKCriTqu2ig9KEq8azIskLWroD7gWXIiCggvxE1y68Qdc+AnisoIbF56kAVFRT0ju3Lkz52Y4iq1rrkf0FJPq6hsam+LNiZbWtvaOZGdXwbVKjiryqqVbzpoiu0LXTJH3NE8Xa7YjZEPRxaqyPxOcr5aF42qWueId2mLLkHdNbUdTZY+p9U3F8A8qRZ8qxWSK0hRW/0+QiUAKUS1ayTtsYhsWVJRgQMCEx1iHDJefDWRAsJnbgs+cw0gLzwUqSLC3xCrBCpnZff7u8m4jYk3eBz3d0K3yLTq/Djv7MUiPdENVeqBbeqb3X3v5YY/gXw55VWpeYRc7jnuX3/51Gbx62Pt0/eFQWP2/LsjmYQcTYSaNM9ohE6RVa/7y0Wl1eXJp0B+iK3rhnJf0RPec1Cy/qtc5sXSGBA8q830sP0FhJJ0ZS1NuNJWdjkYWRx8GMMxzGUcW81hEnu81cYJzXEg90pQ0K83VpFIs8nTjS0kLH6bYlaw=</latexit>
x0
(a) DDPM-based models (b) GAN-based models (c) WaveFit
<latexit sha1_base64="+802x5u8mV5scpnnoIFSdVyQLts=">AAAClnichVHLSsNQFBzjq9ZX1Y3gRiyKq3IqouJCRPGxrNWqUEWSeNsG8yK5LdTiD7gXF4Ki4EL8BJdu/AEX/QRxqeDGhSdpQFTUE5I7d+7MSSZHc03Dl0T1JqW5pbWtPdYR7+zq7ulN9PVv+k7Z00VOd0zH29ZUX5iGLXLSkKbYdj2hWpoptrSDxeB8qyI833DsDVl1xa6lFm2jYOiqZCq/Y6mypKtmbfloL5GkFIU1/BOkI5BEVBkncYcd7MOBjjIsCNiQjE2o8PnKIw2Cy9wuasx5jIzwXOAIcfaWWSVYoTJ7wM8i7/IRa/M+6OmHbp3fYvLtsXMYo/RIN/RCD3RLT/T+a69a2CP4liqvWsMr3L3e48H1t39dFq8SpU/XHw6N1f/rgmwSBcyEmQzO6IZMkFZv+CuHpy/rs9nR2hhd0TPnvKQ63XNSu/KqX6+J7BniPKj097H8BJsTqfRUitYmk/ML0chiGMIIxnku05jHKjLIhX/2BOe4UAaVOWVJWWlIlabIM4AvpWQ+AICdlgg=</latexit>
F
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>
y0
<latexit sha1_base64="vkOcqP3l5iv4PNLUqQWZXpBdbVs=">AAAClXichVHLSsNQFJzGV62vqgsFN2JRXJVTERVBKCriTqu2ig9KEq8azIskLWroD7gWXIiCggvxE1y68Qdc+AnisoIbF56kAVFRT0ju3Lkz52Y4iq1rrkf0FJPq6hsam+LNiZbWtvaOZGdXwbVKjiryqqVbzpoiu0LXTJH3NE8Xa7YjZEPRxaqyPxOcr5aF42qWueId2mLLkHdNbUdTZY+p9U3F8A8qRZ8qxWSK0hRW/0+QiUAKUS1ayTtsYhsWVJRgQMCEx1iHDJefDWRAsJnbgs+cw0gLzwUqSLC3xCrBCpnZff7u8m4jYk3eBz3d0K3yLTq/Djv7MUiPdENVeqBbeqb3X3v5YY/gXw55VWpeYRc7jnuX3/51Gbx62Pt0/eFQWP2/LsjmYQcTYSaNM9ohE6RVa/7y0Wl1eXJp0B+iK3rhnJf0RPec1Cy/qtc5sXSGBA8q830sP0FhJJ0ZS1NuNJWdjkYWRx8GMMxzGUcW81hEnu81cYJzXEg90pQ0K83VpFIs8nTjS0kLH6bYlaw=</latexit>
x0
Noise generation
MMSE
Output
<latexit sha1_base64="nWeGbGNLJAMTdElJLWuIGd8Gu/I=">AAAClXichVHLSsNQFBzjq9ZX1YWCm2JRXJVTERVBKCriTqu2FR+UJF41mBdJWqihP+BacCEKCi7ET3Dpxh9w4SeISwU3LjxJA6KinpDcuXNnzs1wFFvXXI/osUFqbGpuaY21xds7Oru6Ez29BdcqO6rIq5ZuOeuK7ApdM0Xe0zxdrNuOkA1FF0XlYC44L1aE42qWueZVbbFtyHumtqupssfUxpZi+NVayadaKZGiNIWV/AkyEUghqmUrcYst7MCCijIMCJjwGOuQ4fKziQwINnPb8JlzGGnhuUANcfaWWSVYITN7wN893m1GrMn7oKcbulW+RefXYWcSw/RA1/RC93RDT/T+ay8/7BH8S5VXpe4Vdqn7aGD17V+XwauH/U/XHw6F1f/rgmwedjEVZtI4ox0yQVq17q8cnrysTq8M+yN0Sc+c84Ie6Y6TmpVX9SonVk4R50Flvo/lJyiMpTMTacqNp7Kz0chiGMQQRnkuk8hiEcvI870mjnGGc6lfmpHmpYW6VGqIPH34UtLSB6kFla0=</latexit>
y0
<latexit sha1_base64="feNpNBS1NA6fRrK97NgwcBF5KBw=">AAACl3ichVHNLsRQGD3qf/wNNsRGTIiNyVcRxIaQiKXBIDEyaeui0b+0dyYZzbyAB2Bh4SexEI9gaeMFLDyCWJLYWPjaaSIIvqa95557znd78umeZQaS6LFOqW9obGpuaU21tXd0dqW7e9YDt+QbIm+4lutv6logLNMReWlKS2x6vtBs3RIb+sFCdL5RFn5gus6arHhi29b2HHPXNDTJVKGg22GlWgzlmFotpjOUpbgGfwI1ARkkteymb1HADlwYKMGGgAPJ2IKGgJ8tqCB4zG0jZM5nZMbnAlWk2FtilWCFxuwBf/d4t5WwDu+jnkHsNvgWi1+fnYMYpge6phe6pxt6ovdfe4Vxj+hfKrzqNa/wil1Hfatv/7psXiX2P11/OHRW/6+LsknsYjrOZHJGL2aitEbNXz48eVmdWRkOR+iSnjnnBT3SHSd1yq/GVU6snCLFg1K/j+UnWB/PqpNZyk1k5uaTkbVgAEMY5blMYQ5LWEae7/VwjDOcK/3KrLKoLNWkSl3i6cWXUnIfWkyWYw==</latexit>
yt1
<latexit sha1_base64="OspnX+yHSbX7qkurs5eErJTXzw8=">AAACk3ichVE9S8NQFD3G7/pVFUFwEUvFqdyKqOhS1MFF0Go/oEpJ4qsG80XyWqjFP+Do4qCLgoP4Exxd/AMO/gRxVHBx8CYNiIp6Q96777xzbnI4mmsaviR6bFFa29o7Oru6Yz29ff0D8cGhvO9UPV3kdMd0vKKm+sI0bJGThjRF0fWEammmKGgHy8F9oSY833DsLVl3xY6l7tlGxdBVyVBxW7Ma9aOyLMcTlKKwxn826ahJIKp1J36LbezCgY4qLAjYkNybUOHzU0IaBJexHTQY87gzwnuBI8RYW2WWYIbK6AGve3wqRajN52CmH6p1/orJr8fKcSTpga7phe7php7o/ddZjXBG8C913rWmVrjlgePRzbd/VRbvEvufqj8UGrP/5wXeJCqYDz0Z7NENkcCt3tTXDk9fNheyycYkXdIz+7ygR7pjp3btVb/aENkzxDio9PdYfjb56VR6NkUbM4nMUhRZF8YwgSnOZQ4ZrGIduTCPE5zhXBlRFpUlZaVJVVoizTC+lLL2AdJ9lOU=</latexit>
yt
<latexit sha1_base64="IoBXzpB/i/H3onkl9rKzdMt4ZYs=">AAACk3ichVE9S8NQFD3Gr1q/qiIILmJRnMqtiIoupTq4CK21WmilJPFVQ/NFkhZq8Q84ujjoouAg/gRHF/+Agz9BHCu4OHiTBkRFvSHv3XfeOTc5HMXWNdcjeuqQOru6e3ojfdH+gcGh4djI6I5r1RxV5FVLt5yCIrtC10yR9zRPFwXbEbKh6GJXqa7597t14biaZW57DVvsGfKBqVU0VfYYKpQUo9k4Lm+XY3FKUFBTP5tk2MQRVsaK3aGEfVhQUYMBARMe9zpkuPwUkQTBZmwPTcYc7rTgXuAYUdbWmCWYITNa5fWAT8UQNfnsz3QDtcpf0fl1WDmFGXqkG2rRA93SM73/OqsZzPD/pcG70tYKuzx8MpF7+1dl8O7h8FP1h0Jh9v8835uHCpYDTxp7tAPEd6u29fWjs1ZuZWumOUtX9MI+L+mJ7tmpWX9Vr7Ni6xxRDir5PZafzc58IrmYoOxCPJUOI4tgEtOY41yWkMIGMsgHeZziHBfSuLQqpaX1NlXqCDVj+FLS5geNfZTF</latexit>
yT
<latexit sha1_base64="Zv18Jv5MgC4Mxrz/CC1v3efY/dE=">AAACkXichVE9S8NQFD2N3/WjVRfBpVgUp3Ir4tdUdBFcbGtroYok8amx+SJJC7X4B5zcRJ0UHMSf4OjiH3DoTxDHCi4O3qQBUVFvSN55551zXw5XsXXN9YiaEamjs6u7p7cv2j8wOBSLD48UXavqqKKgWrrllBTZFbpmioKneboo2Y6QDUUXm0plxT/frAnH1Sxzw6vbYtuQ901tT1Nlj6nilmI01OOdeJJSFFTiJ0iHIImw1q34PbawCwsqqjAgYMJjrEOGy08ZaRBs5rbRYM5hpAXnAseIsrfKKsEKmdkKf/d5Vw5Zk/d+Tzdwq3yLzq/DzgQm6YluqUWPdEfP9P5rr0bQw/+XOq9K2yvsndjJWP7tX5fBq4eDT9cfDoXV/+v8bB72sBBk0jijHTB+WrXtrx2dtfJLucnGFF3TC+e8oiY9cFKz9qreZEXuElEeVPr7WH6C4kwqPZei7GwysxyOrBfjmMA0z2UeGaxiHQW+9xCnOMeFNCotShkp1EqR0DOKLyWtfQCLJJPo</latexit>
c
<latexit sha1_base64="Zv18Jv5MgC4Mxrz/CC1v3efY/dE=">AAACkXichVE9S8NQFD2N3/WjVRfBpVgUp3Ir4tdUdBFcbGtroYok8amx+SJJC7X4B5zcRJ0UHMSf4OjiH3DoTxDHCi4O3qQBUVFvSN55551zXw5XsXXN9YiaEamjs6u7p7cv2j8wOBSLD48UXavqqKKgWrrllBTZFbpmioKneboo2Y6QDUUXm0plxT/frAnH1Sxzw6vbYtuQ901tT1Nlj6nilmI01OOdeJJSFFTiJ0iHIImw1q34PbawCwsqqjAgYMJjrEOGy08ZaRBs5rbRYM5hpAXnAseIsrfKKsEKmdkKf/d5Vw5Zk/d+Tzdwq3yLzq/DzgQm6YluqUWPdEfP9P5rr0bQw/+XOq9K2yvsndjJWP7tX5fBq4eDT9cfDoXV/+v8bB72sBBk0jijHTB+WrXtrx2dtfJLucnGFF3TC+e8oiY9cFKz9qreZEXuElEeVPr7WH6C4kwqPZei7GwysxyOrBfjmMA0z2UeGaxiHQW+9xCnOMeFNCotShkp1EqR0DOKLyWtfQCLJJPo</latexit>
c
<latexit sha1_base64="Zv18Jv5MgC4Mxrz/CC1v3efY/dE=">AAACkXichVE9S8NQFD2N3/WjVRfBpVgUp3Ir4tdUdBFcbGtroYok8amx+SJJC7X4B5zcRJ0UHMSf4OjiH3DoTxDHCi4O3qQBUVFvSN55551zXw5XsXXN9YiaEamjs6u7p7cv2j8wOBSLD48UXavqqKKgWrrllBTZFbpmioKneboo2Y6QDUUXm0plxT/frAnH1Sxzw6vbYtuQ901tT1Nlj6nilmI01OOdeJJSFFTiJ0iHIImw1q34PbawCwsqqjAgYMJjrEOGy08ZaRBs5rbRYM5hpAXnAseIsrfKKsEKmdkKf/d5Vw5Zk/d+Tzdwq3yLzq/DzgQm6YluqUWPdEfP9P5rr0bQw/+XOq9K2yvsndjJWP7tX5fBq4eDT9cfDoXV/+v8bB72sBBk0jijHTB+WrXtrx2dtfJLucnGFF3TC+e8oiY9cFKz9qreZEXuElEeVPr7WH6C4kwqPZei7GwysxyOrBfjmMA0z2UeGaxiHQW+9xCnOMeFNCotShkp1EqR0DOKLyWtfQCLJJPo</latexit>
c
<latexit sha1_base64="OLYUr7p9HaNR3EPcARLNgI3OThY=">AAAC6XichVHLThRRED00PnB8MODGxMRMnGBg4aTGGCGsCBCVhZGHAyQMmXQ3F+hMv9J9ZxLszI4VO1YaXalxoS71D9z4Ay7wD4xLTNy48PQjGiRKdbrr1Ll1qqtuWaHrxFrkoM/oP3X6zNmBc6XzFy5eGiwPDS/HQSeyVcMO3CBatcxYuY6vGtrRrloNI2V6lqtWrPZMer7SVVHsBP4jvROqdc/c8p1NxzY1qVb5WtMz9bZtusm93mhzbqNys/Kbudsba5WrUpPMKsdBvQBVFDYflL+giQ0EsNGBBwUfmtiFiZjPGuoQhOTWkZCLiJzsXKGHErUdZilmmGTb/G4xWitYn3FaM87UNv/i8o2orGBEPssbOZRP8k6+ys9/1kqyGmkvO/RWrlVha3DvytKPE1Uevcb2H9V/FBazT85LZ9PYxEQ2k8MZw4xJp7Vzfffxk8OlycWR5Ia8lG+c84UcyEdO6ne/268X1OJzVk/rz1KV33VE9KDo9SErKjJplN5Wwsw57qhHlPsS11z/e6nHwfKtWv1OTRZuV6emi4UP4CquY5RbHccU7mMeDXawi7d4jw9G29g3nhrP8lSjr9BcxhEzXv0CdaKqJA==</latexit>
G(Id F)
<latexit sha1_base64="OLYUr7p9HaNR3EPcARLNgI3OThY=">AAAC6XichVHLThRRED00PnB8MODGxMRMnGBg4aTGGCGsCBCVhZGHAyQMmXQ3F+hMv9J9ZxLszI4VO1YaXalxoS71D9z4Ay7wD4xLTNy48PQjGiRKdbrr1Ll1qqtuWaHrxFrkoM/oP3X6zNmBc6XzFy5eGiwPDS/HQSeyVcMO3CBatcxYuY6vGtrRrloNI2V6lqtWrPZMer7SVVHsBP4jvROqdc/c8p1NxzY1qVb5WtMz9bZtusm93mhzbqNys/Kbudsba5WrUpPMKsdBvQBVFDYflL+giQ0EsNGBBwUfmtiFiZjPGuoQhOTWkZCLiJzsXKGHErUdZilmmGTb/G4xWitYn3FaM87UNv/i8o2orGBEPssbOZRP8k6+ys9/1kqyGmkvO/RWrlVha3DvytKPE1Uevcb2H9V/FBazT85LZ9PYxEQ2k8MZw4xJp7Vzfffxk8OlycWR5Ia8lG+c84UcyEdO6ne/268X1OJzVk/rz1KV33VE9KDo9SErKjJplN5Wwsw57qhHlPsS11z/e6nHwfKtWv1OTRZuV6emi4UP4CquY5RbHccU7mMeDXawi7d4jw9G29g3nhrP8lSjr9BcxhEzXv0CdaKqJA==</latexit>
G(Id F)
<latexit sha1_base64="OLYUr7p9HaNR3EPcARLNgI3OThY=">AAAC6XichVHLThRRED00PnB8MODGxMRMnGBg4aTGGCGsCBCVhZGHAyQMmXQ3F+hMv9J9ZxLszI4VO1YaXalxoS71D9z4Ay7wD4xLTNy48PQjGiRKdbrr1Ll1qqtuWaHrxFrkoM/oP3X6zNmBc6XzFy5eGiwPDS/HQSeyVcMO3CBatcxYuY6vGtrRrloNI2V6lqtWrPZMer7SVVHsBP4jvROqdc/c8p1NxzY1qVb5WtMz9bZtusm93mhzbqNys/Kbudsba5WrUpPMKsdBvQBVFDYflL+giQ0EsNGBBwUfmtiFiZjPGuoQhOTWkZCLiJzsXKGHErUdZilmmGTb/G4xWitYn3FaM87UNv/i8o2orGBEPssbOZRP8k6+ys9/1kqyGmkvO/RWrlVha3DvytKPE1Uevcb2H9V/FBazT85LZ9PYxEQ2k8MZw4xJp7Vzfffxk8OlycWR5Ia8lG+c84UcyEdO6ne/268X1OJzVk/rz1KV33VE9KDo9SErKjJplN5Wwsw57qhHlPsS11z/e6nHwfKtWv1OTRZuV6emi4UP4CquY5RbHccU7mMeDXawi7d4jw9G29g3nhrP8lSjr9BcxhEzXv0CdaKqJA==</latexit>
G(Id F)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>
p(yt1|yt,c)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>
p(yt1|yt,c)
<latexit sha1_base64="Kb+9m5lRalDo8sTB/SjNIFpjXOw=">AAAC8HichVHLahRRED1pX3F8ZNSN4GZwGImgQ7WIiqugLnQh5uEkgUwYujs3ySX9ovvOwNj0D+gHKLiQCC7UlVtduvEHXMQ/EJcR3Ljw9ANEg6aa7jp1bp3qqltu7OvUiOxMWAcOHjp8ZPJo49jxEyenmqdOL6bRMPFUz4v8KFl2nVT5OlQ9o42vluNEOYHrqyV363ZxvjRSSaqj8KEZx2o1cDZCva49x5AaNDvxdN8NsnE+yMxlO2/1A73WqhlzqURefnHQbEtXSmvtBXYN2qhtNmp+QR9riOBhiAAKIQyxDwcpnxXYEMTkVpGRS4h0ea6Qo0HtkFmKGQ7ZLX43GK3UbMi4qJmWao9/8fkmVLbQkc/yWnblk7yVr/Lzn7WyskbRy5jerbQqHkw9PrvwY19VQG+w+Vv1H4XL7P3zitkM1nGjnElzxrhkimm9Sj969HR34eZ8J7sgL+Ub59yWHfnIScPRd+/VnJp/zupF/TtUVXedEN2ve33AiopMERW3lTHzHneUE1W+wTXbfy91L1i80rWvdWXuanvmVr3wSZzDeUxzq9cxg7uYRY8dPME7vMcHK7GeWS+s7SrVmqg1Z/CHWW9+Ad7OrVc=</latexit>
p(yt1|yt,c)
Fig. 1. Overview of (a) DDPM, (b) GAN-based model, and (c) pro-
posed WaveFit. (a) DDPM is an iterative-style model, where sam-
pling from the posterior is realized by adding noise to the denoised
intermediate signals. (b) GAN-based models predict y0by a non-
iterative DNN Fwhich is trained to minimize an adversarial loss
calculated from y0and the target speech x0. (c) Proposed WaveFit
is an iterative-style model without adding noise at each iteration, and
Fis trained to minimize an adversarial loss calculated from all inter-
mediate signals yT−1,...,y0, where Id and Gdenote the identity
operator and a gain adjustment operator, respectively.
Since a DDPM-based neural vocoder iteratively refines speech
waveform, there is a trade-off between its sound quality and compu-
tational cost [42], i.e., tens of iterations are required to achieve high-
fidelity speech waveform. To reduce the number of iterations while
maintaining the quality, existing studies of DDPMs have investigated
the inference noise schedule [44], the use of adaptive prior [45, 46],
the network architecture [47, 48], and/or the training strategy [49].
However, generating a speech waveform with quality comparable to
human natural speech in a few iterations is still challenging.
Recent studies demonstrated that the essence of DDPMs and
GANs can coexist [50, 51]. Denoising diffusion GANs [50] use a
generator to predict a clean sample from a diffused one and a dis-
criminator is used to differentiate the diffused samples from the clean
or predicted ones. This strategy was applied to TTS, especially to
predict a log-mel spectrogram given an input text [51]. As DDPMs
and GANs can be combined in several different ways, there will be a
new combination which is able to achieve the high quality synthesis
with a small number of iterations.
This study proposes WaveFit, an iterative-style non-AR neural
vocoder, trained using a GAN-based loss as illustrated in Fig. 1 (c).
It is inspired by the theory of fixed-point iteration [52]. The pro-
posed model iteratively applies a DNN as a denoising mapping that
arXiv:2210.01029v1 [eess.AS] 3 Oct 2022