Viewpoint invariant semantic object and scene categorization with RGB-D sensors

Hasan F.M. Zaki, Faisal Shafait, Ajmal Mian

Research output: Contribution to journalArticle

92 Downloads (Pure)

Abstract

Understanding the semantics of objects and scenes using multi-modal RGB-D sensors serves many robotics applications. Key challenges for accurate RGB-D image recognition are the scarcity of training data, variations due to viewpoint changes and the heterogeneous nature of the data. We address these problems and propose a generic deep learning framework based on a pre-trained convolutional neural network, as a feature extractor for both the colour and depth channels. We propose a rich multi-scale feature representation, referred to as convolutional hypercube pyramid (HP-CNN), that is able to encode discriminative information from the convolutional tensors at different levels of detail. We also present a technique to fuse the proposed HP-CNN with the activations of fully connected neurons based on an extreme learning machine classifier in a late fusion scheme which leads to a highly discriminative and compact representation. To further improve performance, we devise HP-CNN-T which is a view-invariant descriptor extracted from a multi-view 3D object pose (M3DOP) model. M3DOP is learned from over 140,000 RGB-D images that are synthetically generated by rendering CAD models from different viewpoints. Extensive evaluations on four RGB-D object and scene recognition datasets demonstrate that our HP-CNN and HP-CNN-T consistently outperforms state-of-the-art methods for several recognition tasks by a significant margin.

Original languageEnglish
Pages (from-to)1005-1022
Number of pages18
JournalAutonomous Robots
Early online date5 Jul 2018
DOIs
Publication statusPublished - Apr 2019

Fingerprint

Semantics
Image recognition
Sensors
Electric fuses
Neurons
Tensors
Learning systems
Computer aided design
Robotics
Classifiers
Fusion reactions
Chemical activation
Color
Neural networks
Deep learning
Rendering (computer graphics)

Cite this

@article{17224564a4d749e9a4a57addc8c4f268,
title = "Viewpoint invariant semantic object and scene categorization with RGB-D sensors",
abstract = "Understanding the semantics of objects and scenes using multi-modal RGB-D sensors serves many robotics applications. Key challenges for accurate RGB-D image recognition are the scarcity of training data, variations due to viewpoint changes and the heterogeneous nature of the data. We address these problems and propose a generic deep learning framework based on a pre-trained convolutional neural network, as a feature extractor for both the colour and depth channels. We propose a rich multi-scale feature representation, referred to as convolutional hypercube pyramid (HP-CNN), that is able to encode discriminative information from the convolutional tensors at different levels of detail. We also present a technique to fuse the proposed HP-CNN with the activations of fully connected neurons based on an extreme learning machine classifier in a late fusion scheme which leads to a highly discriminative and compact representation. To further improve performance, we devise HP-CNN-T which is a view-invariant descriptor extracted from a multi-view 3D object pose (M3DOP) model. M3DOP is learned from over 140,000 RGB-D images that are synthetically generated by rendering CAD models from different viewpoints. Extensive evaluations on four RGB-D object and scene recognition datasets demonstrate that our HP-CNN and HP-CNN-T consistently outperforms state-of-the-art methods for several recognition tasks by a significant margin.",
keywords = "Multi-modal deep learning, Object categorization, RGB-D image, Scene recognition",
author = "Zaki, {Hasan F.M.} and Faisal Shafait and Ajmal Mian",
year = "2019",
month = "4",
doi = "10.1007/s10514-018-9776-8",
language = "English",
pages = "1005--1022",
journal = "Autonomous Robots",
issn = "0929-5593",
publisher = "Springer",

}

Viewpoint invariant semantic object and scene categorization with RGB-D sensors. / Zaki, Hasan F.M.; Shafait, Faisal; Mian, Ajmal.

In: Autonomous Robots, 04.2019, p. 1005-1022.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Viewpoint invariant semantic object and scene categorization with RGB-D sensors

AU - Zaki, Hasan F.M.

AU - Shafait, Faisal

AU - Mian, Ajmal

PY - 2019/4

Y1 - 2019/4

N2 - Understanding the semantics of objects and scenes using multi-modal RGB-D sensors serves many robotics applications. Key challenges for accurate RGB-D image recognition are the scarcity of training data, variations due to viewpoint changes and the heterogeneous nature of the data. We address these problems and propose a generic deep learning framework based on a pre-trained convolutional neural network, as a feature extractor for both the colour and depth channels. We propose a rich multi-scale feature representation, referred to as convolutional hypercube pyramid (HP-CNN), that is able to encode discriminative information from the convolutional tensors at different levels of detail. We also present a technique to fuse the proposed HP-CNN with the activations of fully connected neurons based on an extreme learning machine classifier in a late fusion scheme which leads to a highly discriminative and compact representation. To further improve performance, we devise HP-CNN-T which is a view-invariant descriptor extracted from a multi-view 3D object pose (M3DOP) model. M3DOP is learned from over 140,000 RGB-D images that are synthetically generated by rendering CAD models from different viewpoints. Extensive evaluations on four RGB-D object and scene recognition datasets demonstrate that our HP-CNN and HP-CNN-T consistently outperforms state-of-the-art methods for several recognition tasks by a significant margin.

AB - Understanding the semantics of objects and scenes using multi-modal RGB-D sensors serves many robotics applications. Key challenges for accurate RGB-D image recognition are the scarcity of training data, variations due to viewpoint changes and the heterogeneous nature of the data. We address these problems and propose a generic deep learning framework based on a pre-trained convolutional neural network, as a feature extractor for both the colour and depth channels. We propose a rich multi-scale feature representation, referred to as convolutional hypercube pyramid (HP-CNN), that is able to encode discriminative information from the convolutional tensors at different levels of detail. We also present a technique to fuse the proposed HP-CNN with the activations of fully connected neurons based on an extreme learning machine classifier in a late fusion scheme which leads to a highly discriminative and compact representation. To further improve performance, we devise HP-CNN-T which is a view-invariant descriptor extracted from a multi-view 3D object pose (M3DOP) model. M3DOP is learned from over 140,000 RGB-D images that are synthetically generated by rendering CAD models from different viewpoints. Extensive evaluations on four RGB-D object and scene recognition datasets demonstrate that our HP-CNN and HP-CNN-T consistently outperforms state-of-the-art methods for several recognition tasks by a significant margin.

KW - Multi-modal deep learning

KW - Object categorization

KW - RGB-D image

KW - Scene recognition

UR - http://www.scopus.com/inward/record.url?scp=85049555799&partnerID=8YFLogxK

U2 - 10.1007/s10514-018-9776-8

DO - 10.1007/s10514-018-9776-8

M3 - Article

SP - 1005

EP - 1022

JO - Autonomous Robots

JF - Autonomous Robots

SN - 0929-5593

ER -