Chuan Wen1, Nicolai Pedersen1, Jens Hjortkjær1,2
Hearing Systems Section, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
2Danish Research Centre for Magnetic Resonance, Centre for Functional and Diagnostic Imaging and Research, Copenhagen University Hospital – Amager and Hvidovre, Copenhagen, Denmark

Acoustic separation of constituent speech streams from multi-talker mixtures is a challenging task for man and machine. Recent studies have shown that visual information of the speaker’s face can aid automatic speech separation systems. Audio-visual speech separation systems based on deep neural networks have achieved high speaker-independent performance with only a single microphone channel. However, large-scale datasets are required to obtain satisfactory results, which introduces a heavy computation overhead. To address this problem, we propose an audio-visual deep-learning-based speech separation framework which decouples the audio-visual fusion process from the separation model. Audio-visual features are first optimized independently using Correlational Neural Networks (CorrNet). Visual features extracted from the fusion stage are subsequently applied to the separation network, constituting ‘visual hints’ at the target speech. For two-talker mixtures, our audiovisual separation framework achieves an average performance of 8.09dB scale-invariant source-to-distortion ratio (SI-SDR) improvement. The performance rivals current state-of-the-art separation systems relying on substantially larger networks and more training data.