Towards Unified Representation of Multi-Modal Pre-training for 3D Understanding via Differentiable Rendering
April 21, 2024 Β· Declared Dead Β· π arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Ben Fei, Yixuan Li, Weidong Yang, Lipeng Ma, Ying He
arXiv ID
2404.13619
Category
cs.MM: Multimedia
Citations
2
Venue
arXiv.org
Last Checked
3 months ago
Abstract
State-of-the-art 3D models, which excel in recognition tasks, typically depend on large-scale datasets and well-defined category sets. Recent advances in multi-modal pre-training have demonstrated potential in learning 3D representations by aligning features from 3D shapes with their 2D RGB or depth counterparts. However, these existing frameworks often rely solely on either RGB or depth images, limiting their effectiveness in harnessing a comprehensive range of multi-modal data for 3D applications. To tackle this challenge, we present DR-Point, a tri-modal pre-training framework that learns a unified representation of RGB images, depth images, and 3D point clouds by pre-training with object triplets garnered from each modality. To address the scarcity of such triplets, DR-Point employs differentiable rendering to obtain various depth images. This approach not only augments the supply of depth images but also enhances the accuracy of reconstructed point clouds, thereby promoting the representative learning of the Transformer backbone. Subsequently, using a limited number of synthetically generated triplets, DR-Point effectively learns a 3D representation space that aligns seamlessly with the RGB-Depth image space. Our extensive experiments demonstrate that DR-Point outperforms existing self-supervised learning methods in a wide range of downstream tasks, including 3D object classification, part segmentation, point cloud completion, semantic segmentation, and detection. Additionally, our ablation studies validate the effectiveness of DR-Point in enhancing point cloud understanding.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Multimedia
π
π
Old Age
R.I.P.
π»
Ghosted
Viewport-Adaptive Navigable 360-Degree Video Delivery
π
π
The Cartographer
A Comprehensive Survey on Cross-modal Retrieval
π
π
The Cartographer
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
R.I.P.
π»
Ghosted
A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding
R.I.P.
π»
Ghosted
Video Generation From Text
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted