Evaluation

It is challenging to measure the quality of a source separation method. The previous tutorial introduces two main categories (i.e., objective and subjective) for measuring the quality of a source separation approach in detail. This section summarizes each evaluation method as follows.

Objective

Objective methods measure the dissimilarity (or similarity) between ground-truth targets and the estimated signals. Source-to-Distortion Ratio (SDR), Source-to-Interference Ratio (SIR), and Source-to-Artifact Ratio (SAR), proposed in [VGFevotte06], are the most widely used objective evaluation methods. The MDX challenge uses SDR as an evaluation metric.

Source-to-Distortion Ratio (SDR)

The Source-to-Distortion Ratio (SDR) measures the ratio of the energy of a source to the energy of the distortion.

[VGFevotte06] assumed that an estimate of a source \(\hat{s}_i\) consists of four components,

\[ \hat{s}_i = s_{\text{target}} + e_{\text{interf}} + e_{\text{noise}} + e_{\text{artif}}, \]

where \(s_{\text{target}}\) is the true source, and \(e_{\text{interf}}\), \(e_{\text{noise}}\), and \(e_{\text{artif}}\) are error terms for interference, noise, and added artifacts, respectively.

Each term must be calculated respectively to obtain SIR or SNR.

\[\begin{split} \begin{align} s_\text{distortion} & = e_{\text{interf}} + e_{\text{noise}} + e_{\text{artif}} \\ & = \hat{s}_i - s_{\text{target}} \end{align} \end{split}\]

However, SDR does not require individual terms because distortion is defined as the sum \(e_{\text{interf}} + e_{\text{noise}} + e_{\text{artif}}\), which is identical to \(\hat{s}_i - s_{\text{target}}\).

\[\begin{split} \begin{align} \text{SDR} & := 10 \log_{10} \left( \frac{\| s_{\text{target}} \|^2}{ \| e_{\text{interf}} + e_{\text{noise}} + e_{\text{artif}} \|^2} \right) \\ & = 10 \log_{10} \left( \frac{\| s_{\text{target}} \|^2}{ \| \hat{s}_i - s_{\text{target}} \|^2} \right) \end{align} \end{split}\]

If a result is ideal (i.e., \(\hat{s}_i = s_{\text{target}} \)), it goes positive infinity. If a result is far from the ground truth, then the energy of the distortion goes large, which makes SDR lower. SDR is usually considered to be an overall measure of how good a source sounds. If a paper only reports one number for estimated quality, it is usually SDR.

Global vs Framewise Computation

The MDX challenge uses global SDR, which computes the metric on the entire song. Some existing evaluation tools such as museval [StoterLI18] compute the metric on shorter frames (or windows) and take the average as a result.

Source-to-Artifact Ratio (SAR)

\[ \text{SAR} := 10 \log_{10} \left( \frac{\| s_{\text{target}} + e_{\text{interf}} + e_{\text{noise}} \|^2}{ \| e_{\text{artif}} \|^2} \right) \]

The Source-to-Artifact Ratio (SAR) measures the ratio of the energy of a source to the energy of the artifacts. Artifacts are unwanted noises, usually generated from models (e.g. neural networks, algorithms), not other sources.

Source-to-Interference Ratio (SIR)

\[ \text{SIR} := 10 \log_{10} \left( \frac{\| s_{\text{target}} \|^2}{ \| e_{\text{interf}} \|^2} \right) \]

The Source-to-Interference Ratio (SIR) measures the ratio of the energy of a source to the energy of the interferences. If we can hear sounds clearer from the other sources in the estimated source, SIR would be lower. It is similar to the concept of “bleed”, or “leakage”.

Subjective

Although many researchers have used the SDR family for their quantitative evaluation, SDR does not tell everything. As discussed in the previous tutorial, there are some cases where the SDR results are quite different from human perception.

Subjective measures, where human evaluators listen to samples and score, might be alternative if participants are sufficiently large. Although this usually requires a lot of time/money and it is hard to make realiable results, it provides evaluation reflecting human perception.