Projects per year
Abstract
Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state of-the-art on MSVD and MSR-VTT datasets for METEORand ROUGEL metrics.
Original language | English |
---|---|
Title of host publication | The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |
Place of Publication | USA |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 12479-12488 |
Number of pages | 10 |
ISBN (Electronic) | 9781728132938 |
DOIs | |
Publication status | Published - Jun 2019 |
Event | IEEE Conference on Computer Vision and Pattern Recognition 2019 - Long Beach Convention & Entertainment Center, Long Beach, United States Duration: 16 Jun 2019 → 20 Jun 2019 http://cvpr2019.thecvf.com/ |
Publication series
Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
---|---|
Volume | 2019-June |
ISSN (Print) | 1063-6919 |
Conference
Conference | IEEE Conference on Computer Vision and Pattern Recognition 2019 |
---|---|
Abbreviated title | CVPR 2019 |
Country/Territory | United States |
City | Long Beach |
Period | 16/06/19 → 20/06/19 |
Internet address |
Fingerprint
Dive into the research topics of 'Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning'. Together they form a unique fingerprint.Projects
- 2 Finished
-
Defense against adversarial attacks on deep learning in computer vision
Mian, A. (Investigator 01)
ARC Australian Research Council
1/01/19 → 31/03/24
Project: Research
-
View and shape invariant modeling of human actions for smart surveillance
Mian, A. (Investigator 01)
ARC Australian Research Council
1/01/16 → 31/12/19
Project: Research