Continuous Gesture Segmentation and Recognition using 3DCNN and Convolutional LSTM

Guangming Zhu, Liang Zhang, Peiyi Shen, Juan Song, Syed Afaq Ali Shah, Mohammed Bennamoun

Research output: Contribution to journalArticle

Abstract

Continuous gesture recognition aims at recognizing the ongoing gestures from continuous gesture sequences, and is more meaningful for the scenarios where the start and end frames of each gesture instance are generally unknown in practical applications. This paper presents an effective deep architecture for continuous gesture recognition. Firstly, continuous gesture sequences are segmented into isolated gesture instances using the proposed temporal dilated Res3D network. A balanced squared hinge loss function is proposed to deal with the imbalance between boundaries and non-boundaries. Temporal dilation can preserve the temporal information for the dense detection of the boundaries at fine granularity, and the large temporal receptive field makes the segmentation results more reasonable and effective. Then, the recognition network is constructed based on the 3D convolutional neural network (3DCNN), the convolutional Long-Short-Term-Memory network (ConvLSTM), and the 2D convolutional neural network (2DCNN) for isolated gesture recognition. The "3DCNN-ConvLSTM-2DCNN" architecture is more effective to learn long-term and deep spatiotemporal features. The proposed segmentation and recognition networks obtain the Jaccard index of 0.7163 on the Chalearn LAP ConGD dataset, which is 0.106 higher than the winner of 2017 ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.

Original languageEnglish
Pages (from-to)1011-1021
JournalIEEE Transactions on Multimedia
DOIs
Publication statusPublished - Apr 2019

Fingerprint

Gesture recognition
Neural networks
Hinges
Network architecture
Long short-term memory

Cite this

@article{90004a26c1654b37a193c35f89a381d1,
title = "Continuous Gesture Segmentation and Recognition using 3DCNN and Convolutional LSTM",
abstract = "Continuous gesture recognition aims at recognizing the ongoing gestures from continuous gesture sequences, and is more meaningful for the scenarios where the start and end frames of each gesture instance are generally unknown in practical applications. This paper presents an effective deep architecture for continuous gesture recognition. Firstly, continuous gesture sequences are segmented into isolated gesture instances using the proposed temporal dilated Res3D network. A balanced squared hinge loss function is proposed to deal with the imbalance between boundaries and non-boundaries. Temporal dilation can preserve the temporal information for the dense detection of the boundaries at fine granularity, and the large temporal receptive field makes the segmentation results more reasonable and effective. Then, the recognition network is constructed based on the 3D convolutional neural network (3DCNN), the convolutional Long-Short-Term-Memory network (ConvLSTM), and the 2D convolutional neural network (2DCNN) for isolated gesture recognition. The {"}3DCNN-ConvLSTM-2DCNN{"} architecture is more effective to learn long-term and deep spatiotemporal features. The proposed segmentation and recognition networks obtain the Jaccard index of 0.7163 on the Chalearn LAP ConGD dataset, which is 0.106 higher than the winner of 2017 ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.",
keywords = "3DCNN, Continuous Gesture Recognition, Convolutional LSTM, Dilation, Fasteners, Gesture recognition, Motion segmentation, Proposals, Spatiotemporal phenomena, Three-dimensional displays, Videos",
author = "Guangming Zhu and Liang Zhang and Peiyi Shen and Juan Song and Shah, {Syed Afaq Ali} and Mohammed Bennamoun",
year = "2019",
month = "4",
doi = "10.1109/TMM.2018.2869278",
language = "English",
pages = "1011--1021",
journal = "IEEE Transactions on Multimedia",
issn = "1520-9210",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",

}

Continuous Gesture Segmentation and Recognition using 3DCNN and Convolutional LSTM. / Zhu, Guangming; Zhang, Liang; Shen, Peiyi; Song, Juan; Shah, Syed Afaq Ali; Bennamoun, Mohammed.

In: IEEE Transactions on Multimedia, 04.2019, p. 1011-1021.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Continuous Gesture Segmentation and Recognition using 3DCNN and Convolutional LSTM

AU - Zhu, Guangming

AU - Zhang, Liang

AU - Shen, Peiyi

AU - Song, Juan

AU - Shah, Syed Afaq Ali

AU - Bennamoun, Mohammed

PY - 2019/4

Y1 - 2019/4

N2 - Continuous gesture recognition aims at recognizing the ongoing gestures from continuous gesture sequences, and is more meaningful for the scenarios where the start and end frames of each gesture instance are generally unknown in practical applications. This paper presents an effective deep architecture for continuous gesture recognition. Firstly, continuous gesture sequences are segmented into isolated gesture instances using the proposed temporal dilated Res3D network. A balanced squared hinge loss function is proposed to deal with the imbalance between boundaries and non-boundaries. Temporal dilation can preserve the temporal information for the dense detection of the boundaries at fine granularity, and the large temporal receptive field makes the segmentation results more reasonable and effective. Then, the recognition network is constructed based on the 3D convolutional neural network (3DCNN), the convolutional Long-Short-Term-Memory network (ConvLSTM), and the 2D convolutional neural network (2DCNN) for isolated gesture recognition. The "3DCNN-ConvLSTM-2DCNN" architecture is more effective to learn long-term and deep spatiotemporal features. The proposed segmentation and recognition networks obtain the Jaccard index of 0.7163 on the Chalearn LAP ConGD dataset, which is 0.106 higher than the winner of 2017 ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.

AB - Continuous gesture recognition aims at recognizing the ongoing gestures from continuous gesture sequences, and is more meaningful for the scenarios where the start and end frames of each gesture instance are generally unknown in practical applications. This paper presents an effective deep architecture for continuous gesture recognition. Firstly, continuous gesture sequences are segmented into isolated gesture instances using the proposed temporal dilated Res3D network. A balanced squared hinge loss function is proposed to deal with the imbalance between boundaries and non-boundaries. Temporal dilation can preserve the temporal information for the dense detection of the boundaries at fine granularity, and the large temporal receptive field makes the segmentation results more reasonable and effective. Then, the recognition network is constructed based on the 3D convolutional neural network (3DCNN), the convolutional Long-Short-Term-Memory network (ConvLSTM), and the 2D convolutional neural network (2DCNN) for isolated gesture recognition. The "3DCNN-ConvLSTM-2DCNN" architecture is more effective to learn long-term and deep spatiotemporal features. The proposed segmentation and recognition networks obtain the Jaccard index of 0.7163 on the Chalearn LAP ConGD dataset, which is 0.106 higher than the winner of 2017 ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.

KW - 3DCNN

KW - Continuous Gesture Recognition

KW - Convolutional LSTM

KW - Dilation

KW - Fasteners

KW - Gesture recognition

KW - Motion segmentation

KW - Proposals

KW - Spatiotemporal phenomena

KW - Three-dimensional displays

KW - Videos

UR - http://www.scopus.com/inward/record.url?scp=85053134115&partnerID=8YFLogxK

U2 - 10.1109/TMM.2018.2869278

DO - 10.1109/TMM.2018.2869278

M3 - Article

SP - 1011

EP - 1021

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

SN - 1520-9210

ER -