University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data

Kong, Qiuqiang, Xu, Yong, Sobieraj, Iwona, Wang, Wenwu and Plumbley, Mark D. (2019) Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27 (4). pp. 777-787.

This is the latest version of this item.

[img]
Preview
Text
Sound Event Detection.pdf - Accepted version Manuscript

Download (2MB) | Preview

Abstract

Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data which contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the time-frequency domain. Then a classification mapping is applied on the T-F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling (GWRP). In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534, 0.398 and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331, 0.237 and 0.120, respectively. In T-F segmentation, we achieved an F1 score of 0.218, where previous methods were not able to do T-F segmentation.

Item Type: Article
Divisions : Faculty of Engineering and Physical Sciences > Electronic Engineering
Authors :
NameEmailORCID
Kong, Qiuqiangq.kong@surrey.ac.uk
Xu, Yongyong.xu@surrey.ac.uk
Sobieraj, Iwonaiwona.sobieraj@surrey.ac.uk
Wang, WenwuW.Wang@surrey.ac.uk
Plumbley, Mark D.m.plumbley@surrey.ac.uk
Date : 1 February 2019
Funders : Engineering and Physical Sciences Research Council (EPSRC), European Union’s H2020 Framework Programme
DOI : 10.1109/TASLP.2019.2895254
Copyright Disclaimer : © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Uncontrolled Keywords : Sound event detection; Time-frequency segmentation; Weakly labelled data; Convolutional neural network
Depositing User : Clive Harris
Date Deposited : 18 Feb 2019 13:50
Last Modified : 06 Jul 2019 05:26
URI: http://epubs.surrey.ac.uk/id/eprint/850457

Available Versions of this Item

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year


Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800