University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Audiovisual Emotion Recognition in an English Database

Haq, S, Jackson, PJB and Edge, J (2008) Audiovisual Emotion Recognition in an English Database In: One Day Meeting for Young Speech Researchers, 2008-07-16, Guildford, surrey, UK.

Full text not available from this repository.


Human communication is based on verbal and nonverbal information, e.g., facial expressions and intonation cue the speaker’s emotional state. Important speech features for emotion recognition are prosody (pitch, energy and duration) and voice quality (spectral energy, formants, MFCCs, jitter/shimmer). For facial expressions, features related to forehead, eye region, cheek and lip are important. Both audio and visual modalities provide relevant cues. Thus, audio and visual features were extracted and combined to evaluate emotion recognition on a British English corpus. The database of 120 utterances was recorded from an actor with 60 markers painted on his face, reading sentences in seven emotions (N=7): anger, disgust, fear, happiness, neutral, sadness and surprise. Recordings consisted of 15 phonetically-balanced TIMIT sentences per emotion, and video of the face captured by a 3dMD system. A total of 106 utterance-level audio features (prosodic and spectral) and 240 visual features (2D marker coordinates) were extracted. Experiments were performed with audio, visual and audiovisual features. The top 40 features were selected by sequential forward backward search using Bhattacharyya distance criterion. PCA and LDA transformations, calculated on the training data, were applied. Gaussian classifiers were trained with PCA and LDA features. Data was jack-knifed with 5 sets for training and 1 set for testing. Results were averaged over 6 tests. The emotion recognition accuracy was higher for visual features than audio features, for both PCA and LDA. Audiovisual results were close to those with visual features. Higher performance was achieved with LDA compared to PCA. The best recognition rate, 98%, was achieved for 6 LDA features (N-1) with audiovisual and visual features, whereas audio LDA scored 53%. Maximum PCA results for audio, visual and audiovisual features were 41%, 97% and 88% respectively. Future work involves experiments with more subjects and investigating the correlation between vocal and facial expressions of emotion.

Item Type: Conference or Workshop Item (Conference Poster)
Divisions : Surrey research (other units)
Authors :
Haq, S
Edge, J
Date : 16 July 2008
Depositing User : Symplectic Elements
Date Deposited : 17 May 2017 12:48
Last Modified : 10 Feb 2020 09:53

Actions (login required)

View Item View Item


Downloads per month over past year

Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800