Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization

Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, Dan Xu

Research output: Chapter in Book/Conference paperConference paperpeer-review


Weakly supervised dense object localization (WSDOL) relies generally on Class Activation Mapping (CAM), which exploits the correlation between the class weights of the image classifier and the pixel-level features. Due to the limited ability to address intra-class variations, the image classifier cannot properly associate the pixel features, leading to inaccurate dense localization maps. In this paper, we propose to explicitly construct multi-modal class representations by leveraging the Contrastive Language-Image Pre-training (CLIP), to guide dense localization. More specifically, we propose a unified transformer framework to learn two-modalities of class-specific tokens, i.e., class-specific visual and textual tokens. The former captures semantics from the target visual data while the latter exploits the class-related language priors from CLIP, providing complementary information to better perceive the intra-class diversities. In addition, we propose to enrich the multi-modal class-specific tokens with sample-specific contexts comprising visual context and image-language context. This enables more adaptive class representation learning, which further facilitates dense localization. Extensive experiments show the superiority of the proposed method for WSDOL on two multi-label datasets, i.e., PASCAL VOC and MS COCO, and one single-label dataset, i.e., OpenImages. Our dense localization maps also lead to the state-of-the-art weakly supervised semantic segmentation (WSSS) results on PASCAL VOC and MS COCO.11https://github.com/xulianuwa/MMCST
Original languageEnglish
Title of host publication2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE, Institute of Electrical and Electronics Engineers
ISBN (Electronic)9798350301298
Publication statusPublished - 2023
EventThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 - Vancouver Convention Center, Vancouver, Canada
Duration: 18 Jun 202322 Jun 2023


ConferenceThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023
Abbreviated titleCVPR 2023

Cite this