University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

Xu, Yong, Huang, Qiang, Wang, Wenwu, Foster, Peter, Sigtia, S, Jackson, Philip and Plumbley, Mark (2017) Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 25 (6). pp. 1230-1241.

[img]
Preview
Text
XuHuangWangFSJP17-CC-BY.pdf - Version of Record

Download (1MB) | Preview
[img]
Preview
Text (licence)
SRI_deposit_agreement.pdf
Available under License : See the attached licence file.

Download (33kB) | Preview
[img] Other
FINAL VERSION.PDF - Accepted version Manuscript
Restricted to Repository staff only

Download (3MB)
[img] Text
Taslp_yong_final_v1.pdf - Accepted version Manuscript
Restricted to Repository staff only
Available under License : See the attached licence file.

Download (2MB)

Abstract

Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.

Item Type: Article
Subjects : Electronic Engineering
Divisions : Faculty of Engineering and Physical Sciences > Electronic Engineering
Authors :
NameEmailORCID
Xu, Yongyong.xu@surrey.ac.ukUNSPECIFIED
Huang, Qiangq.huang@surrey.ac.ukUNSPECIFIED
Wang, WenwuW.Wang@surrey.ac.ukUNSPECIFIED
Foster, PeterUNSPECIFIEDUNSPECIFIED
Sigtia, SUNSPECIFIEDUNSPECIFIED
Jackson, PhilipP.Jackson@surrey.ac.ukUNSPECIFIED
Plumbley, Markm.plumbley@surrey.ac.uk0000-0002-9708-1075
Date : 8 June 2017
Identification Number : 10.1109/TASLP.2017.2690563
Copyright Disclaimer : © 2017 The Authors. This work is licensed under a Creative Commons Attribution 3.0 License.
Uncontrolled Keywords : Environmental audio tagging, deep neural networks, unsupervised feature learning, deep denoising autoencoder, DCASE 2016.
Related URLs :
Depositing User : Symplectic Elements
Date Deposited : 08 Mar 2017 18:59
Last Modified : 31 Oct 2017 19:11
URI: http://epubs.surrey.ac.uk/id/eprint/813726

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year


Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800