OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

Accepted to ECCV 2024

1Georgia Institute of Technology, 2Google Research*
*Now at Google DeepMind
OmniNOCS dataset

OmniNOCS is a unified NOCS (Normalized Object Coordinate Space) dataset that contains data across different domains with 90+ object classes. It is the largest NOCS dataset to date.

Predictions on web images

We use OmniNOCS to train NOCSformer, a transformer-based model that can predict NOCS and 6DoF poses from 2D object boxes. NOCSformer does not contains any class-specific parameters and generalizes to in-the-wild images from internet collections.

Abstract

We propose OmniNOCS, a large-scale monocular dataset with 3D Normalized Object Coordinate Space (NOCS) maps, object masks, and 3D bounding box annotations for indoor and outdoor scenes. OmniNOCS has 20 times more object classes and 200 times more instances than existing NOCS datasets (NOCS-Real275, Wild6D).

We use OmniNOCS to train a novel, transformer-based monocular NOCS prediction model "NOCSformer" that can predict accurate NOCS, instance masks and poses from 2D object detections across diverse classes. It is the first NOCS model that can generalize to a broad range of classes when prompted with 2D boxes. We evaluate our model on the task of 3D oriented bounding box prediction, where it achieves comparable results to state-of-the-art 3D detection methods such as CubeRCNN. Unlike other 3D detection methods, our model also provides detailed and accurate 3D object shape and segmentation. We propose a novel benchmark for the task of NOCS prediction based on OmniNOCS, which we hope will serve as a useful baseline for future work in this area.

NOCSformer Model

NOCSformer model architecture

Our NOCSformer architecture uses a DINOv2 backbone to extract features from the input image, pools them using the input 2D object boxes, and feeds the per-object features to the NOCS and size heads. The self-attention based NOCS head jointly predicts the NOCS and instance mask for the RoI. Our learned PnP head for pose estimation uses the predicted NOCS and instance mask to predict the projected 3D centroid and 3D orientation of the object.

NOCS predictions across OmniNOCS

NOCSformer generalizes to the wide range of object classes and domains in OmniNOCS, including indoor and outdoor scenes, as well as object-centric images.

The per-frame predictions from NOCSformer on the Objectron video sequences below show that our model is temporally consistent.

NOCS predictions on COCO objects

NOCSformer can generalize to in-the-wild objects in COCO images when trained on OmniNOCS.

Related Work

Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild introduced a large-scale cross-domain 3D object detection dataset that inspired our work.

Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation introduced the concept of NOCS and its applications in 6D pose estimation.