ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction (AAAI-25)

Paper | Video | Project Page

This is the official implementation for ViPOcc:

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction

Yi Feng, Yu Han , Xijing Zhang, Tanghui Li, Yanting Zhang, Rui Fan

📋 Abstract

Inferring the 3D structure of a scene from a single image is an ill-posed and challenging problem in the field of vision-centric autonomous driving. Existing methods usually employ neural radiance fields to produce voxelized 3D occupancy, lacking instance-level semantic reasoning and temporal photometric consistency. In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models (VFMs) for fine-grained 3D occupancy prediction. Unlike previous works that solely employ volume rendering for RGB and depth image reconstruction, we introduce a metric depth estimation branch, in which an inverse depth alignment module is proposed to bridge the domain gap in depth distribution between VFM predictions and the ground truth. The recovered metric depth is then utilized in temporal photometric alignment and spatial geometric alignment to ensure accurate and consistent 3D occupancy prediction. Additionally, we also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling, which addresses the redundant and imbalanced sampling issue that still exists in previous state-of-the-arts. Extensive experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks on the KITTI-360 and KITTI Raw datasets.

🪧 Overview

We propose ViPOcc, a single-view 3D Occupancy prediction framework that incorporates Visual Priors from VFMs. ViPOcc takes stereo image pairs, rectified fisheye images, and camera projection matrices as input, and generates 2D depth estimations and 3D occupancy predictions simultaneously.
The Inverse Depth Alignment module effectively aligns VFM's depth priors with real-world depth distributions while preserving their local differential properties.
The proposed SNOG sampler effectively guides the framework to focus more on crucial instances and avoid overlapping patches during ray sampling.
Reconstruction consistency loss and temporal alignment loss are proposed to ensure inter-frame photometric consistency and intra-frame geometric reconstruction consistency.

🏗️️ Installation

Follow detailed instructions in Installation.

💾 Dataset Preparation

Follow detailed instructions in Datasets Preparation.

📸 Checkpoints

The official weights of our model ViPOcc can be downloaded here.

The official weights of BTS can be downloaded here. Rename it to BTS-KITTI-360.pt.

Checkpoints should be put in the checkpoints/ directory.

📽 Demo

We provide a shell script to generate 3D occupancy predictions, predicted depth maps, rendered depth maps, pseudo depth maps, and bev maps:

python eval.py -cn demo \
  'model_conf.backbone.use_depth_branch=true' \
  'model_conf.draw_occ=true' \
  'model_conf.draw_rendered_depth=true' \
  'model_conf.draw_pred_depth=true' \
  'model_conf.draw_pseudo_depth=true' \
  'model_conf.draw_bev=true' \
  'model_conf.save_dir="visualization/vipocc"'

📊 Evaluation

We provide evaluation scripts to (1) reproduce the ViPOcc evaluation results for 3D occupancy prediction and depth estimation, (2) evaluate performance of Depth Anything V2, and (3) reproduce the BTS evaluation results.

# ============== ViPOcc Evaluation ==============

# eval occ
python eval.py -cn eval_lidar_occ


# eval rendered depth
python eval.py -cn eval_depth_kitti360



# eval predicted depth
python eval.py -cn eval_depth_kitti360 \
  "model_conf.eval_rendered_depth=false"

# ============== BTS Official Evaluation ==============
# KITTI-360, occupancy prediction
python eval.py -cn eval_lidar_occ \
"checkpoint=\"checkpoints/BTS-KITTI-360.pt\"" \
'model_conf.z_range=[20, 4]' \
'model_conf.backbone.use_depth_branch=false'



# KITTI-360, depth estimation
python eval.py -cn eval_depth_kitti360 \
"checkpoint=\"checkpoints/BTS-KITTI-360.pt\""


# ============== Depth Anything V2 Evaluation ==============
# KITTI-360, pseudo depth estimation (median scaling)
python eval.py -cn eval_pseudo_depth

# (no scaling)
python eval.py -cn eval_pseudo_depth \
'model_conf.depth_scaling=null'

🏋 Training

We provide training scripts for ViPOcc. It can be trained on a single Nvidia 4090 GPU.

python train.py -cn exp_kitti_360

🗣️ Acknowledgements

Our code is based on BTS and KYN. We sincerely appreciate their amazing works.

Many thanks to these excellent open-source projects: MonoDepth2, PixelNerf, Depth Anything V2, Grounded-SAM.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
configs		configs
data_preprocess		data_preprocess
datasets		datasets
docs		docs
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction (AAAI-25)

Paper | Video | Project Page

📋 Abstract

🪧 Overview

🏗️️ Installation

💾 Dataset Preparation

📸 Checkpoints

📽 Demo

📊 Evaluation

🏋 Training

🗣️ Acknowledgements

About

Releases

Packages

Languages

License

fengyi233/ViPOcc

Folders and files

Latest commit

History

Repository files navigation

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction (AAAI-25)

Paper | Video | Project Page

📋 Abstract

🪧 Overview

🏗️️ Installation

💾 Dataset Preparation

📸 Checkpoints

📽 Demo

📊 Evaluation

🏋 Training

🗣️ Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages