Click here to flash read.
Diffusion-based generative models have had a high impact on the computer
vision and speech processing communities these past years. Besides data
generation tasks, they have also been employed for data restoration tasks like
speech enhancement and dereverberation. While discriminative models have
traditionally been argued to be more powerful e.g. for speech enhancement,
generative diffusion approaches have recently been shown to narrow this
performance gap considerably. In this paper, we systematically compare the
performance of generative diffusion models and discriminative approaches on
different speech restoration tasks. For this, we extend our prior contributions
on diffusion-based speech enhancement in the complex time-frequency domain to
the task of bandwith extension. We then compare it to a discriminatively
trained neural network with the same network architecture on three restoration
tasks, namely speech denoising, dereverberation and bandwidth extension. We
observe that the generative approach performs globally better than its
discriminative counterpart on all tasks, with the strongest benefit for
non-additive distortion models, like in dereverberation and bandwidth
extension. Code and audio examples can be found online at
https://uhh.de/inf-sp-sgmsemultitask
No creative common's license