ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction (AAAI-25)
Paper | Video | Project Page
This is the official implementation for ViPOcc:
ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction
Yi Feng, Yu Han , Xijing Zhang, Tanghui Li, Yanting Zhang, Rui Fan
Inferring the 3D structure of a scene from a single image is an ill-posed and challenging problem in the field of vision-centric autonomous driving. Existing methods usually employ neural radiance fields to produce voxelized 3D occupancy, lacking instance-level semantic reasoning and temporal photometric consistency. In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models (VFMs) for fine-grained 3D occupancy prediction. Unlike previous works that solely employ volume rendering for RGB and depth image reconstruction, we introduce a metric depth estimation branch, in which an inverse depth alignment module is proposed to bridge the domain gap in depth distribution between VFM predictions and the ground truth. The recovered metric depth is then utilized in temporal photometric alignment and spatial geometric alignment to ensure accurate and consistent 3D occupancy prediction. Additionally, we also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling, which addresses the redundant and imbalanced sampling issue that still exists in previous state-of-the-arts. Extensive experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks on the KITTI-360 and KITTI Raw datasets.
-
We propose ViPOcc, a single-view 3D Occupancy prediction framework that incorporates Visual Priors from VFMs. ViPOcc takes stereo image pairs, rectified fisheye images, and camera projection matrices as input, and generates 2D depth estimations and 3D occupancy predictions simultaneously.
-
The Inverse Depth Alignment module effectively aligns VFM's depth priors with real-world depth distributions while preserving their local differential properties.
-
The proposed SNOG sampler effectively guides the framework to focus more on crucial instances and avoid overlapping patches during ray sampling.
-
Reconstruction consistency loss and temporal alignment loss are proposed to ensure inter-frame photometric consistency and intra-frame geometric reconstruction consistency.
Follow detailed instructions in Installation.
Follow detailed instructions in Datasets Preparation.
The official weights of our model ViPOcc can be downloaded here.
The official weights of BTS can be downloaded here. Rename it to BTS-KITTI-360.pt
.
Checkpoints should be put in the checkpoints/
directory.
We provide a shell script to generate 3D occupancy predictions, predicted depth maps, rendered depth maps, pseudo depth maps, and bev maps:
python eval.py -cn demo \
'model_conf.backbone.use_depth_branch=true' \
'model_conf.draw_occ=true' \
'model_conf.draw_rendered_depth=true' \
'model_conf.draw_pred_depth=true' \
'model_conf.draw_pseudo_depth=true' \
'model_conf.draw_bev=true' \
'model_conf.save_dir="visualization/vipocc"'
We provide evaluation scripts to (1) reproduce the ViPOcc evaluation results for 3D occupancy prediction and depth estimation, (2) evaluate performance of Depth Anything V2, and (3) reproduce the BTS evaluation results.
# ============== ViPOcc Evaluation ==============
# eval occ
python eval.py -cn eval_lidar_occ
# eval rendered depth
python eval.py -cn eval_depth_kitti360
# eval predicted depth
python eval.py -cn eval_depth_kitti360 \
"model_conf.eval_rendered_depth=false"
# ============== BTS Official Evaluation ==============
# KITTI-360, occupancy prediction
python eval.py -cn eval_lidar_occ \
"checkpoint=\"checkpoints/BTS-KITTI-360.pt\"" \
'model_conf.z_range=[20, 4]' \
'model_conf.backbone.use_depth_branch=false'
# KITTI-360, depth estimation
python eval.py -cn eval_depth_kitti360 \
"checkpoint=\"checkpoints/BTS-KITTI-360.pt\""
# ============== Depth Anything V2 Evaluation ==============
# KITTI-360, pseudo depth estimation (median scaling)
python eval.py -cn eval_pseudo_depth
# (no scaling)
python eval.py -cn eval_pseudo_depth \
'model_conf.depth_scaling=null'
We provide training scripts for ViPOcc. It can be trained on a single Nvidia 4090 GPU.
python train.py -cn exp_kitti_360
Our code is based on BTS and KYN. We sincerely appreciate their amazing works.
Many thanks to these excellent open-source projects: MonoDepth2, PixelNerf, Depth Anything V2, Grounded-SAM.