Hybrid Demucs¶
Demucs is based on U-Net convolutional architecture inspired by Wave-U-Net. The most recent version features hybrid spectrogram/waveform separation, along with compressed residual branches, local attention and singular value regularization. Checkout our paper Hybrid Spectrogram and Waveform Source Separation for more details. As far as we know, Demucs is currently the only model supporting true end-to-end hybrid model training with shared information between the domains, as opposed to post-training model blending.
For detailed instructions on how to install, use and train Demucs, checkout the Demucs repository.
Dual U-Net Hybrid architecture¶
In order to add support for working at the same time in the spectrogram and waveform domain, Hybrid Demucs uses a dual U-Net structure. The temporal branch gradually reduces the number of time steps, while the spectral branch gradually reduces the number of frequency bins. After layer 5, the two representations have the same shape and can be summed. The 6-th layer is shared bewteen the two branches. The opposition happen in the decoder.
The output of the spectral branch is inversed to a waveform in a differentiable manner and summed to the output of the temporal branch. The loss is thus always in the waveform domain and training is completely end-to-end. The model is free to use combine both domains and exchange information between them.
We provide hereafter a visual representation of the two branches U-Net structure.
Fig. 2 Dual U-Net structure of Demucs. The input waveform is processed both through a temporal encoder, and first through the STFT followed by a spectral encoder. The two representations are summed when their dimensions align. The decoder is built symmetrically. The output spectrogram go through the ISTFT and is summed with the waveform outputs, giving the final model output. The ‘Z’ prefix is used for spectral layers, and ‘T’ prefix for the temporal ones.¶
Compressed Residual branches¶
Hybrid Demucs also features new compressed residual branches that increase the expressivity of the model and allow it to handle long range context. Please refer to the paper for a detailed description of the content of those branches.
Fig. 3 Representation of the compressed residual branches that are added to each encoder layer. For the 5th and 6th layer, a BiLSTM and a local attention layer are added.¶
Results¶
We provide hereafter a summary of the different metrics presented in the Hybrid Demucs paper. You can also compare Hybrid Demucs (v3), KUIELAB-MDX-Net, Spleeter, Open-Unmix, Demucs (v1), and Conv-Tasnet on one of my favorite songs on my soundcloud playlist.
Comparison of accuracy¶
Overall SDR is the mean of the SDR for each of the 4 sources, MOS Quality is a rating from 1 to 5
of the naturalness and absence of artifacts given by human listeners (5 = no artifacts), MOS Contamination
is a rating from 1 to 5 with 5 being zero contamination by other sources. We refer the reader to our paper,
for more details.
Model |
Domain |
Extra data? |
Overall SDR |
MOS Quality |
MOS Contamination |
|---|---|---|---|---|---|
waveform |
no |
3.2 |
- |
- |
|
spectrogram |
no |
5.3 |
- |
- |
|
spectrogram |
no |
6.0 |
- |
- |
|
waveform |
no |
5.7 |
- |
||
waveform |
no |
6.3 |
2.37 |
2.36 |
|
spectrogram |
no |
6.7 |
- |
- |
|
hybrid |
no |
7.5 |
2.86 |
2.55 |
|
Hybrid Demucs (v3) |
hybrid |
no |
7.7 |
2.83 |
3.04 |
spectrogram |
804 songs |
6.0 |
- |
- |
|
spectrogram |
1.5k songs |
6.7 |
- |
- |
|
spectrogram |
25k songs |
5.9 |
- |
- |
How to use Hybrid Demucs in Colab¶
Feel free to use the Colab version: https://colab.research.google.com/drive/1dC9nVxk3V_VPjUADsnFu8EiT-xnU1tGH?usp=sharing