University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Source separation of convolutive and noisy mixtures using audio-visual dictionary learning and probabilistic time-frequency masking

Liu, Q, Wang, W, Jackson, PJB, Barnard, M, Kittler, J and Chambers, J (2013) Source separation of convolutive and noisy mixtures using audio-visual dictionary learning and probabilistic time-frequency masking IEEE Transactions on Signal Processing, 61 (22), 99. pp. 5520-5535.

[img] Text
Restricted to Repository staff only
Available under License : See the attached licence file.

Download (6MB)
[img] Text (licence)
Restricted to Repository staff only

Download (33kB)


In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g. Gaussian mixture models (GMMs). These methods often operate in a lowdimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandel’s BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.

Item Type: Article
Authors :
Liu, Q
Wang, W
Jackson, PJB
Barnard, M
Kittler, J
Chambers, J
Date : 15 November 2013
DOI : 10.1109/TSP.2013.2277834
Depositing User : Symplectic Elements
Date Deposited : 28 Mar 2017 13:21
Last Modified : 31 Oct 2017 15:13

Actions (login required)

View Item View Item


Downloads per month over past year

Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800