Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Research output: Chapter in Book/Conference paperConference paper

147 Downloads (Pure)

Abstract

Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state of-the-art on MSVD and MSR-VTT datasets for METEORand ROUGEL metrics.
Original languageEnglish
Title of host publicationThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Place of PublicationUSA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages12487-12496
Number of pages10
Publication statusPublished - 2019
EventIEEE Conference on Computer Vision and Pattern Recognition 2019 - Long Beach Convention & Entertainment Center, Long Beach, United States
Duration: 16 Jun 201920 Jun 2019

Conference

ConferenceIEEE Conference on Computer Vision and Pattern Recognition 2019
Abbreviated titleCVPR 2019
CountryUnited States
CityLong Beach
Period16/06/1920/06/19

Fingerprint

Semantics
Neural networks
Computer vision
Fourier transforms
Detectors

Cite this

Aafaq, N., Akhtar, N., Liu, W., Gilani, S. Z., & Mian, A. (2019). Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 12487-12496). USA: IEEE, Institute of Electrical and Electronics Engineers.
Aafaq, Nayyer ; Akhtar, Naveed ; Liu, Wei ; Gilani, Syed Zulqarnain ; Mian, Ajmal. / Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). USA : IEEE, Institute of Electrical and Electronics Engineers, 2019. pp. 12487-12496
@inproceedings{802e36c83ff84c0f8f0d5d6dd9428573,
title = "Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning",
abstract = "Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state of-the-art on MSVD and MSR-VTT datasets for METEORand ROUGEL metrics.",
author = "Nayyer Aafaq and Naveed Akhtar and Wei Liu and Gilani, {Syed Zulqarnain} and Ajmal Mian",
year = "2019",
language = "English",
pages = "12487--12496",
booktitle = "The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",
address = "United States",

}

Aafaq, N, Akhtar, N, Liu, W, Gilani, SZ & Mian, A 2019, Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Institute of Electrical and Electronics Engineers, USA, pp. 12487-12496, IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, United States, 16/06/19.

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. / Aafaq, Nayyer; Akhtar, Naveed; Liu, Wei; Gilani, Syed Zulqarnain; Mian, Ajmal.

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). USA : IEEE, Institute of Electrical and Electronics Engineers, 2019. p. 12487-12496.

Research output: Chapter in Book/Conference paperConference paper

TY - GEN

T1 - Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

AU - Aafaq, Nayyer

AU - Akhtar, Naveed

AU - Liu, Wei

AU - Gilani, Syed Zulqarnain

AU - Mian, Ajmal

PY - 2019

Y1 - 2019

N2 - Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state of-the-art on MSVD and MSR-VTT datasets for METEORand ROUGEL metrics.

AB - Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state of-the-art on MSVD and MSR-VTT datasets for METEORand ROUGEL metrics.

UR - http://openaccess.thecvf.com/CVPR2019.py

UR - http://cvpr2019.thecvf.com/

UR - http://cvpr2019.thecvf.com/submission/main_conference/reviewer_guidelines

M3 - Conference paper

SP - 12487

EP - 12496

BT - The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

PB - IEEE, Institute of Electrical and Electronics Engineers

CY - USA

ER -

Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). USA: IEEE, Institute of Electrical and Electronics Engineers. 2019. p. 12487-12496