Abstract
Monocular RGB-based category-level object pose estimation is more practical and cost-effective for robotics. However, existing methods do not fully exploit the rich semantic and contextual information in multimodal data (e.g. language) that provides additional object attributes to guide the model in extracting category features more reliably. We propose a language-guided category-level object pose estimation method (LanCOPE), taking a single RGB image as input. Our method uses DINOv2 to recover depth from a single RGB image and converts it into point cloud to perceive the object's geometry. We then introduce language descriptions for the RGB image, estimated point cloud and overall scene to better guide the point cloud encoder and image encoder in learning category features. We develop a cross-modal differential perception feature fusion network to fuse multimodal features. This network employs a differential perception module to eliminate redundant information across different modalities, highlighting signifcant semantic differences and similarities. Furthermore, it uses a cross-attention mechanism to fuse the semantic information of the language and vision features, improving the overall perception. Finally, we design a denoising network based on the skip fusion transformer to recover the object pose accurately. Extensive experiments on REAL275 and Wild6D datasets show that LanCOPE achieves state-of-the-art performance. Our code is available at LanCOPE.
| Original language | English |
|---|---|
| Pages (from-to) | 7555-7562 |
| Number of pages | 8 |
| Journal | IEEE Robotics and Automation Letters |
| Volume | 10 |
| Issue number | 7 |
| Early online date | 6 Jun 2025 |
| DOIs | |
| Publication status | Published - Jul 2025 |
Fingerprint
Dive into the research topics of 'LanCOPE: Language-Guided Category-Level Object Pose Estimation from a Single RGB Image'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver