« Deep Learning for Skeleton-Based Human Action Recognition»
Promoteur : Prof Thiery Dutoit
Human action recognition from videos has a wide range of applications, including video surveillance and security, human-computer interaction, robotics, health care, etc. Nowadays, 3D skeleton-based action recognition has drawn increasing attention thanks to the availability of low-cost motion capture devices, and accessibility of large-scale 3D skeleton datasets, in addition to real-time skeleton estimation algorithms. In the first part of this thesis, we present a novel representation of motion capture sequences for 3D skeleton-based action recognition. The proposed approach consists of representing the 3D skeleton sequences into RGB image-like data and leveraging recent convolutional neural networks (CNNs) to model the long-term temporal and spatial structural information for action recognition. Extensive experiments have shown the superiority of the proposed approach over the state-of-the-art methods for 3D skeleton-based action recognition.
In order to extract skeleton sequences, different devices extract first the depth information using multiple technologies (stereo, time-of-flight, etc.), then 3D skeleton poses are extracted using different algorithms. In the very late years, new researches proposed to extract skeleton sequences directly from RGB videos. The most precise methods extract 2D skeletons in real-time and with high accuracy.
In the second part of this thesis, we leverage these tools to extend the use of our proposed approach to RGB videos. We first extract 2D skeleton sequences from RGB videos, and then, following approximately the same process as in the first part, we use CNNs for human action recognition. Different experiments showed that the proposed method outperforms different state-of-the-art methods on a large benchmark dataset.
Another contribution of this thesis relates to the interpretability of deep learning models. Deep learning models are still considered alchemy due to the lack of understanding of their internal operations. Interpretability is a crucial task to understand and trust the decisions made by the machine learning model. Thus, we propose in the third part of this thesis to use CNN interpretation methods to understand the behavior of our classifier and extract the most informative joints during the execution of a particular action. This method allows us to see from the CNN point of view the most important joints, and understand why certain actions are confused by the proposed classifier.