Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization

Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, Dan Xu

Research output: Chapter in Book/Conference paperConference paperpeer-review

4 Citations (Web of Science)

Abstract

Weakly supervised dense object localization (WSDOL) relies generally on Class Activation Mapping (CAM), which exploits the correlation between the class weights of the image classifier and the pixel-level features. Due to the limited ability to address intra-class variations, the image classifier cannot properly associate the pixel features, leading to inaccurate dense localization maps. In this paper, we propose to explicitly construct multi-modal class representations by leveraging the Contrastive Language-Image Pre-training (CLIP), to guide dense localization. More specifically, we propose a unified transformer framework to learn two-modalities of class-specific tokens, i.e., class-specific visual and textual tokens. The former captures semantics from the target visual data while the latter exploits the class-related language priors from CLIP, providing complementary information to better perceive the intra-class diversities. In addition, we propose to enrich the multi-modal class-specific tokens with sample-specific contexts comprising visual context and image-language context. This enables more adaptive class representation learning, which further facilitates dense localization. Extensive experiments show the superiority of the proposed method for WSDOL on two multi-label datasets, i.e., PASCAL VOC and MS COCO, and one single-label dataset, i.e., OpenImages. Our dense localization maps also lead to the state-of-the-art weakly supervised semantic segmentation (WSSS) results on PASCAL VOC and MS COCO.11https://github.com/xulianuwa/MMCST
Original languageEnglish
Title of host publication2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages19596-19605
ISBN (Electronic)9798350301298
DOIs
Publication statusPublished - 2023
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition - Vancouver Convention Center, Vancouver, Canada
Duration: 18 Jun 202322 Jun 2023

Conference

Conference2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Abbreviated titleCVPR 2023
Country/TerritoryCanada
CityVancouver
Period18/06/2322/06/23

Fingerprint

Dive into the research topics of 'Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization'. Together they form a unique fingerprint.

Cite this