Abstract

With the proliferation of video platforms on the internet, the recording of musical performances by mobile devices has become commonplace. However, these recordings are often distorted by noise and reverberation, reducing the litening experience. To address this issue, we propose a music enhancement system based on the conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the conformer and examine their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work.

Code: https://github.com/yoongi43/music_audio_enhancement_conformer

Real world samples (Trained on MUSDB18)

[Source: https://www.youtube.com/watch?v=IURmoBrxjEk](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1496fb67-2390-427a-b5c2-9750f3142a55/aespa_video2.mp4)

Source: https://www.youtube.com/watch?v=IURmoBrxjEk

Low-quality

Source: https://www.youtube.com/watch?v=IURmoBrxjEk

Source: https://www.youtube.com/watch?v=IURmoBrxjEk

Source: https://www.youtube.com/watch?v=RMeX1Xl_6AE

Source: https://www.youtube.com/watch?v=RMeX1Xl_6AE

Source: https://www.youtube.com/watch?v=wK3xaG1_sWY

Source: https://www.youtube.com/watch?v=wK3xaG1_sWY

TFC-CPq (ours)

Medley-Solos-DB samples

High-quality

Low-quality

Mel2Mel+Diffwave

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

est_ref_indep.wav

TFC-CPq (ours)

MUSDB18 samples

Medley-Solos-DB samples (trained on MUSDB18)