University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Analysis, Modelling and Animation of Emotional Speech in 3D.

Nadtoka, Nataliya. (2011) Analysis, Modelling and Animation of Emotional Speech in 3D. Doctoral thesis, University of Surrey (United Kingdom)..

Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (9MB) | Preview


This thesis investigates the problem of producing perceptually realistic facial animation of expressions and speech. It spans several different areas of work, from capture and representation of facial dynamics through to analysis and synthesis of expressive 3D animation sequences. For this purpose, a database of 3D facial scans was collected from 16 subjects each performing 7 expressions. Ekman’s set of 6 cross-culturally recognised emotions and a neutral emotion were used. Several representations of facial expressions are compared: morphable model, its extension to tensor space and so on. A multilinear tensor-based morphable model is a powerful tool as it permits to independently control identity and expression. However, its high computational cost and non-intuitive set of parameters have motivated us to opt for a standard 3D morphable modelling approach. We propose a novel algorithm for mapping between motion capture data, projected to spatially low resolution (19 markers) 3D model space, and spatially high resolution (3300 vertices and colour texture) 3D morphable model space. This radial basis function based mapping preserves the temporal characteristics of motion capture data and the level of detail of high resolution 3D scans. The single-subject model is extended to animate other subjects based on a single 3D scan or a photograph. An additional model is needed to represent the variation between individual expression styles. The relation between audio and visual features is analysed based on a 4D dataset of expressive speech. The dataset consists of 3D scans of a single subject, recorded at 60 Hz, and a synchronised audio at 44.1 kHz. The speech corpus contains 235 phonetically balanced expressive English sentences, recorded in 6 emotions and neutral. Audio features consist of fundamental frequency F0, duration, energy and Mel-frequency cepstral coefficients. Face was separated into overlapping facial regions. Visual signal was then used to compute temporal visual features for each facial region. We concentrate on the upper face region due to its high expressive content and lesser contamination by articulation. Phoneme, word and sentence level audio-visual analysis is performed within each emotional category and among all emotional categories. Although, initial results show a promising connection between dynamics of audio and visual features for some emotions, significant intra-class variation exists for the others. Results demonstrate that dynamics and intensity of expressive content within and across sentences are highly influenced by their linguistic content. This work shows that the effect of temporal variation of expressive content is statistically significant and should be taken into account in visual speech synthesis. Further investigation is necessary with a more controlled setup. This thesis provides the foundation for further research towards the understanding of the connection between expressive content and visual dynamics during speech and achieving perceptually realistic animation of a talking head.

Item Type: Thesis (Doctoral)
Divisions : Theses
Authors : Nadtoka, Nataliya.
Date : 2011
Additional Information : Thesis (Ph.D.)--University of Surrey (United Kingdom), 2011.
Depositing User : EPrints Services
Date Deposited : 06 May 2020 14:15
Last Modified : 06 May 2020 14:17

Actions (login required)

View Item View Item


Downloads per month over past year

Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800