EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler

EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler

1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; 2School of Electronic Science and Engineering, Nanjing University; 3Department of Electrical Engineering and Computer Sciences, University of California, Berkeley;
Equal Contribution   ✉ Corresponding Author  

highlight

Highlights

  • We propose a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, adaptively enforcing strong constraints only when both curvature and depth cues indicate planar regions.

  • We propose a Semantic-aware Uncertainty Sampler (SUS) that adaptively selects and updates low-confidence Gaussians in overlapping regions between consecutive frames to mitigate redundancy in memory updates.

  • Our method achieves state-of-the-art (SOTA) performance on the EmbodiedOcc-ScanNet benchmark across various indoor occupancy prediction settings.

Motivation

Illustration of different Gaussian updates. Left: Previous approach with random position updates lacking geometric constraints. Right: Our approach with surface normal-guided updates that leverage planar priors for more accurate indoor scene representation.

Abstract

Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception.

Method Overview

Overview of EmbodiedOcc++. Given monocular RGB inputs, our EmbodiedOcc++ boosts indoor 3D occupancy prediction with plane regularization. To capture intricate geometric details, we design a Geometry-guided Refinement Module that constrains Gaussian updates based on surface priors. To ensure accurate geometric regularization, we incorporate an adaptive constraint fusion strategy that takes multimodel priors into account. For robust and efficient Gaussian refinement, we introduce a Semantic-aware Uncertainty Sampler that actively selects Gaussians to update memory. Our approach is tailored for embodied scene understanding, preserving planar structures and sharp boundaries during progressive indoor exploration.

Geometry-guided Refinement Module

Geometry-guided Refinement Module. Given a constraint, our model ConSeg generates instance-level and part-level masks across multiple views, which are projected into 3D space. Through a series of heuristics, the desired elements are produced. Once all elements are obtained, they are annotated onto the original multi-view images.

Visualization of our local occupancy prediction

The images show how our model effectively constructs occupancy estimates from single monocular views, preserving spatial relationships within the camera frustum.

Visualization of our embodied occupancy prediction

This figure demonstrates the online integration of sequential visual inputs, showing how our model progressively refines its spatial understanding as new observations become available. The temporal sequence illustrates the accumulation of occupancy information across multiple frames.

BibTeX

@article{wang2025embodiedocc++,
    title={EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler},
    author={Wang, Hao and Wei, Xiaobao and Zhang, Xiaoan and Li, Jianing and Bai, Chengyu and Li, Ying and Lu, Ming and Zheng, Wenzhao and Zhang, Shanghang},
    journal={arXiv preprint arXiv:2504.09540},
    year={2025}
}