Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning

Md Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, Hamid Laga, Mohammed Bennamoun

Research output: Chapter in Book/Conference paperConference paperpeer-review

10 Citations (Scopus)

Abstract

In a typical image captioning pipeline, a Convolutional Neural Network (CNN) is used as the image encoder and Long Short-Term Memory (LSTM) as the language decoder. LSTM with attention mechanism has shown remarkable performance on sequential data including image captioning. LSTM can retain long-range dependency of sequential data. However, it is hard to parallelize the computations of LSTM because of its inherent sequential characteristics. In order to address this issue, recent works have shown benefits in using self-attention, which is highly parallelizable without requiring any temporal dependencies. However, existing techniques apply attention only in one direction to compute the context of the words. We propose an attention mechanism called Bi-directional Self-Attention (Bi-SAN) for image captioning. It computes attention both in forward and backward directions. It achieves high performance comparable to state-of-the-art methods.

Original languageEnglish
Title of host publication2019 Digital Image Computing
Subtitle of host publicationTechniques and Applications, DICTA 2019
PublisherIEEE, Institute of Electrical and Electronics Engineers
ISBN (Electronic)9781728138572
DOIs
Publication statusPublished - 1 Dec 2019
Event2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019 - Perth, Australia
Duration: 2 Dec 20194 Dec 2019

Publication series

Name2019 Digital Image Computing: Techniques and Applications, DICTA 2019

Conference

Conference2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019
Country/TerritoryAustralia
CityPerth
Period2/12/194/12/19

Fingerprint

Dive into the research topics of 'Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning'. Together they form a unique fingerprint.

Cite this