A Multi-modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multimodal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multimodal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance - this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability - this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multimodal hierarchical fusion - this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., imageand pixel-levels), and fused into a Conditional Random Field (CRF)- based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.

Original languageEnglish
Pages (from-to)2051-2065
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume40
Issue number9
DOIs
Publication statusPublished - 1 Sep 2018

Fingerprint

Labeling
Neural networks
Invariant
Image classification
Invariance
Labels
Fusion reactions
Pixels
Semantics
Neural Networks
Spatial Variability
Conditional Random Fields
Image Classification
Transformer
Postulate
Experimental Evaluation
Compactness
Object
Class
Regularization

Cite this

@article{ef724fd946e542dba991fa71e8dad602,
title = "A Multi-modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling",
abstract = "While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multimodal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multimodal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance - this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability - this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multimodal hierarchical fusion - this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., imageand pixel-levels), and fused into a Conditional Random Field (CRF)- based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.",
keywords = "3D scene labeling, RGB-D object recognition, Semantic segmentation",
author = "Umar Asif and Mohammed Bennamoun and Ferdous Sohel",
year = "2018",
month = "9",
day = "1",
doi = "10.1109/TPAMI.2017.2747134",
language = "English",
volume = "40",
pages = "2051--2065",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
issn = "0162-8828",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",
number = "9",

}

TY - JOUR

T1 - A Multi-modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling

AU - Asif, Umar

AU - Bennamoun, Mohammed

AU - Sohel, Ferdous

PY - 2018/9/1

Y1 - 2018/9/1

N2 - While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multimodal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multimodal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance - this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability - this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multimodal hierarchical fusion - this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., imageand pixel-levels), and fused into a Conditional Random Field (CRF)- based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.

AB - While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multimodal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multimodal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance - this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability - this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multimodal hierarchical fusion - this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., imageand pixel-levels), and fused into a Conditional Random Field (CRF)- based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.

KW - 3D scene labeling

KW - RGB-D object recognition

KW - Semantic segmentation

UR - http://www.scopus.com/inward/record.url?scp=85028695975&partnerID=8YFLogxK

U2 - 10.1109/TPAMI.2017.2747134

DO - 10.1109/TPAMI.2017.2747134

M3 - Article

VL - 40

SP - 2051

EP - 2065

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

SN - 0162-8828

IS - 9

ER -