TY - GEN
T1 - 3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud
AU - Feng, Mingtao
AU - Hou, Haoran
AU - Zhang, Liang
AU - Wu, Ziiie
AU - Guo, Yulan
AU - Mian, Ajmal
N1 - Funding Information:
6. Acknowledgments This work was supported in part by the National Natural Science Foundation of China under Grant 62003253, Grant 61973106, Grant U2013203, Grant U21A20482 and Grant U20A20185. Professor Ajmal Mian is the recipient of an Australian Research Council Future Fellowship Award (project number FT210100268) funded by the Australian Government.
Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - In-depth understanding of a 3D scene not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since 3D scenes contain partially scanned objects with physical connections, dense placement, changing sizes, and a wide variety of challenging relationships, existing methods perform quite poorly with limited training samples. In this work, we find that the inherently hierarchical structures of physical space in 3D scenes aid in the automatic association of semantic and spatial arrangements, specifying clear patterns and leading to less ambiguous predictions. Thus, they well meet the challenges due to the rich variations within scene categories. To achieve this, we explicitly unify these structural cues of 3D physical spaces into deep neural networks to facilitate scene graph prediction. Specifically, we exploit an external knowledge base as a baseline to accumulate both contextualized visual content and textual facts to form a 3D spatial multimodal knowledge graph. Moreover, we propose a knowledge-enabled scene graph prediction module benefiting from the 3D spatial knowledge to effectively regularize semantic space of relationships. Extensive experiments demonstrate the superiority of the proposed method over current state-of-the-art competitors. Our code is available at https://github.com/HHrEtvP/SMKA.
AB - In-depth understanding of a 3D scene not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since 3D scenes contain partially scanned objects with physical connections, dense placement, changing sizes, and a wide variety of challenging relationships, existing methods perform quite poorly with limited training samples. In this work, we find that the inherently hierarchical structures of physical space in 3D scenes aid in the automatic association of semantic and spatial arrangements, specifying clear patterns and leading to less ambiguous predictions. Thus, they well meet the challenges due to the rich variations within scene categories. To achieve this, we explicitly unify these structural cues of 3D physical spaces into deep neural networks to facilitate scene graph prediction. Specifically, we exploit an external knowledge base as a baseline to accumulate both contextualized visual content and textual facts to form a 3D spatial multimodal knowledge graph. Moreover, we propose a knowledge-enabled scene graph prediction module benefiting from the 3D spatial knowledge to effectively regularize semantic space of relationships. Extensive experiments demonstrate the superiority of the proposed method over current state-of-the-art competitors. Our code is available at https://github.com/HHrEtvP/SMKA.
KW - Scene analysis and understanding
UR - http://www.scopus.com/inward/record.url?scp=85173980356&partnerID=8YFLogxK
U2 - 10.1109/CVPR52729.2023.00886
DO - 10.1109/CVPR52729.2023.00886
M3 - Conference paper
AN - SCOPUS:85173980356
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 9182
EP - 9191
BT - Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
PB - IEEE, Institute of Electrical and Electronics Engineers
T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Y2 - 18 June 2023 through 22 June 2023
ER -