University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Machine learning for human performance capture from multi-viewpoint video.

Trumble, Matthew (2019) Machine learning for human performance capture from multi-viewpoint video. Doctoral thesis, University of Surrey.

MTrumble_thesis_Final.pdf - Version of Record
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (23MB) | Preview


Performance capture is used extensively within the creative industries to efficiently produce high quality, realistic character animation in movies and video games. Existing commercial systems for performance capture are limited to working within constrained environments, requiring wearable visual markers or suits, and frequently specialised imaging devices (e.g. infra-red cameras) both of which limit deployment scenarios (e.g. indoor capture). This thesis explores novel methods to relax these constraints, applying machine learning techniques to estimate human pose using regular video cameras and without the requirement of visible markers on the performer. This unlocks the potential for co-production of principal footage and performance capture data, leading to production efficiencies. For example, using an array of static witness cameras deployed on-set, performance capture data for a video games character accompanying a major movie franchise might be captured at the same time the movie is shot. The need to call the actor for a second day of shooting in a specialised motion capture (mo-cap) facility is avoided, saving time and money, since performance capture was possible without corrupting the principal movie footage with markers or constraining set design. Furthermore, if such performance capture data is available in real-time, the director may immediately pre-visualize the look and feel of the final character animation enabling tighter capture iteration and improved creative direction. This further enhances the potential for production efficiencies. The core technical contributions of this thesis are novel software algorithms that leverage machine learning to fuse of data from multiple sensors – synchronised video cameras, and in some cases, inertial measurement units (IMUs) – in order to robustly estimate human body pose over time, doing so at real-time or near real-time rates. Firstly, a hardware-accelerated capture solution is developed for acquiring coarse volumetric occupancy data from multiple viewpoint video footage, in the form of a probabilistic visual hull (PVH). Using CUDA-based GPU acceleration the PVH may be estimated in real-time, and subsequently used to train machine learning algorithms to infer human skeletal pose from PVH data. Initially a variety of machine learning approaches for skeletal joint pose estimation are explored, contrasting classical and deep inference methods. By quantizing volumetric data into a two-dimensional (2D) spherical histogram representation it is shown that convolutional neural networks (CNN) architectures used traditionally for object recognition may be re-purposed for skeletal joint estimation given suitable a training methodology and data augmentation strategy. The generalization of such architectures to a fully volumetric (3D) CNN is explored, achieving state of the art performance at human pose estimation using an volumetric auto-encoder (hour-glass) architecture that emulates networks traditionally used for de-noising and super-resolution (up-scaling) of 2D data. A framework is developed that is capable of simultaneously estimating human pose from volumetric data, whilst also up-scaling that volumetric data to enable fine-grain estimation of surface detail given a deeply learned prior from previous performance. The method is shown to generalise well even when that prior is learned across different subjects, performing different movements even in different studio camera configurations. Performance can be further improved using a learned temporal model of data, and through the fusion of complementary sensor modalities – video and IMUs – to enhance the accuracy of human pose estimation inferred from a volumetric CNN. Although IMUs have been applied in the performance capture domain for many years, they are prone to drift limiting their use to short capture sequences. The novel fusion of IMU with video data enables improved global localization and so reduced error over time whilst simultaneously mitigating the issues of limb inter-occlusion that can frustrate video-only approaches.

Item Type: Thesis (Doctoral)
Divisions : Theses
Authors :
Trumble, Matthew
Date : 31 January 2019
Funders : Engineering and Physical Sciences Research Council (EPSRC)
DOI : 10.15126/thesis.00850064
Contributors :
Depositing User : Matthew Trumble
Date Deposited : 07 Feb 2019 08:53
Last Modified : 07 Feb 2019 08:54

Actions (login required)

View Item View Item


Downloads per month over past year

Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800