top of page

RESEARCH

>Target Sound Extraction

Target Sound Extraction

introduction.png

Imagine yourself in a bustling café, eager to hear your friend's speech amid the mixture of music, keyboard clatter, and ambient noise. Your brain effortlessly filters through these sounds, focusing solely on your friend's speech, aided by clues like their appearance and direction. What if we could train a deep learning model to do the same? Our research is dedicated to harnessing the potential of deep learning algorithms to precisely extract a specific sound from a complex audio mixture, regardless of its composition.

We're pushing the boundaries of sound extraction by leveraging advanced neural network techniques. Our goal is to develop robust models capable of isolating a target sound amidst a variety of sounds, even in challenging real-world environments with background noise and reverberation. In our recent work, we introduced a Transformer-based model designed specifically for extracting reverberant sounds.

model architecture_edited.png

Proposed Model Architecture

Our approach builds on the Dense Frequency-Time Attentive Network (DeFT-AN) architecture, originally developed for speech enhancement tasks. This architecture generates a complex short-time Fourier transform (STFT) mask to separate clean speech from noisy, reverberant mixtures. To make DeFT-AN compatible with the target sound extraction task, we modify its architecture such that the embedding vector for the target class label can be fused in the middle of sequentially connected DeFT-A blocks constituting DeFT-AN. 

 

The figures below illustrate our model architecture and the results of extracting reverberant target sounds, showcasing the effectiveness of our approach. We continue to refine our Transformer-based models to meet the challenges of real-world sound extraction using multiple clues.

demonstration_edited.png
0_input_mic_0
00:00 / 00:06
1_gt_mic_0
00:00 / 00:06
2_output_mic_0
00:00 / 00:06
0_multi_input_mic_0
00:00 / 00:06
1_multi_gt_mic_0
00:00 / 00:06
2_multi_output_mic_0
00:00 / 00:06

Audio clips for demonstrations

bottom of page