University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Robust multi-speaker tracking via dictionary learning and identity modeling

Barnard, M, Wang, W, Kittler, J, Koniusz, P, Naqvi, SM and Chambers, J (2014) Robust multi-speaker tracking via dictionary learning and identity modeling IEEE Transactions on Multimedia, 16 (3). pp. 864-880.

[img] Text (licence)
Restricted to Repository staff only
Available under License : See the attached licence file.

Download (33kB)
[img] Text
Barnard_PWKNC_TMM_2014.pdf - ["content_typename_Published version (Publisher's proof or final PDF)" not defined]
Restricted to Repository staff only
Available under License : See the attached licence file.

Download (2MB)


We investigate the problem of visual tracking of multiple human speakers in an office environment. In particular, we propose novel solutions to the following challenges: (1) robust and computationally efficient modeling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialization of the trackers, or re-initialization when the trackers have lost lock caused by e.g. the limited camera views. First, we develop new algorithms for appearance modeling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries are used to generate the likelihood functions based on Support Vector Machine (SVM) classification. This likelihood function is then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) is proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model is proposed to track multiple speakers whilst dealing with occlusions. This model is updated online using Maximum a Posteriori (MAP) adaptation, where we control the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialization of the visual trackers, we exploit audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provides, a priori, the number of speakers and constrains the search space for the speaker's faces. The proposed system is tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects. © 2014 IEEE.

Item Type: Article
Authors :
Date : April 2014
Identification Number : 10.1109/TMM.2014.2301977
Depositing User : Symplectic Elements
Date Deposited : 28 Mar 2017 13:11
Last Modified : 31 Oct 2017 16:57

Actions (login required)

View Item View Item


Downloads per month over past year

Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800