University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Audio-visual tracking of multiple moving speakers.

Kilic, V. (2016) Audio-visual tracking of multiple moving speakers. Doctoral thesis, University of Surrey.

[img]
Preview
Text
thesis_final_volkan.pdf - Version of Record
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (19MB) | Preview
[img]
Preview
Text
Author_Deposit_Agreement.pdf
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (165kB) | Preview

Abstract

In this thesis, a novel approach is proposed for multi-speaker tracking by integrating audio and visual data in a particle filtering (PF) framework. This approach is further improved for adaptive estimation of two critical parameters of the PF, namely, the number of particles and noise variance, based on tracking error and the area occupied by the particles in the image. Here, it is assumed that the number of speakers is known and constant during the tracking. To relax this assumption, the random finite set (RFS) theory is used due to its ability in dealing with the problem of tracking a variable number of speakers. However, the computational complexity increases exponentially with the number of speakers, so probability hypothesis density (PHD) filter, which is first order approximation of the RFS, is applied with sequential Monte Carlo (SMC), namely particle filter, implementation since the computational complexity increases linearly with the number of speakers. The SMC-PHD filter in visual tracking uses three types of particles (i.e. surviving, spawned and born particles) to model the state of the speakers and to estimate the number of speakers. We propose to use audio data in the distribution of these particles to improve the visual SMC-PHD filter in terms of estimation accuracy and computational efficiency. The tracking accuracy of the proposed algorithm is further improved by using a modified mean-shift algorithm, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. For quantitative evaluation, both audio and video sequences are required together with the calibration information of the cameras and microphone arrays (circular arrays). To this end, the AV16.3 dataset is used to demonstrate the performance of the proposed methods in a variety of scenarios such as occlusion and rapid movements of the speakers.

Item Type: Thesis (Doctoral)
Divisions : Theses
Authors :
AuthorsEmailORCID
Kilic, V.vkilic2004@gmail.comUNSPECIFIED
Date : 29 January 2016
Funders : Turkish government
Contributors :
ContributionNameEmailORCID
Thesis supervisorWang, W.w.wang@surrey.ac.ukUNSPECIFIED
Thesis supervisorKittler, J.j.kittler@surrey.ac.ukUNSPECIFIED
Thesis supervisorBarnard, M.mark.barnard@surrey.ac.ukUNSPECIFIED
Depositing User : Volkan Kilic
Date Deposited : 09 Feb 2016 10:36
Last Modified : 09 Feb 2016 10:36
URI: http://epubs.surrey.ac.uk/id/eprint/809761

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year


Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800