By Mosam Dabhi

From Pixels to Physical Intelligence

3D Data Generation at Internet Scale

Read ProposalExplore Components

Overview

A self-bootstrapping pipeline for converting raw pixels into large-scale 3D and 4D spatial understanding

Modern AI won't achieve physical intelligence until it can extract rich, semantic spatial knowledge from the wild ocean of internet video—not just curated motion-capture datasets or expensive 3D scans. This thesis proposes a self-bootstrapping pipeline for converting raw pixels into large-scale 3D and 4D spatial understanding.

It begins with multi-view bootstrapping: using just two handheld videos and ~1% 2D keypoints to produce dense 2D, 3D keypoints—no calibration, no 3D ground truth required. This sets the stage for geometry-only supervision at scale.

The 3D lifting foundation model then builds a category-agnostic transformer with zero-shot generalization. It's trained using our bootstrapped data, enabling consistent 3D lifting across diverse object categories from humans to vehicles.

For unsupervised 3D generation, label-free mixers outperform attention architectures for 2D→3D lifting due to data efficiency and inductive biases aligned with geometric constraints. This unlocks large-scale 3D generation without human annotation.

Finally, template-free 4D rigging enables motion retargeting and novel pose synthesis without predefined skeletal structures. This opens the door to internet-scale 4D understanding across unlimited object categories.

Key Components

A multi-stage pipeline for creating physical intelligence from raw pixels

1
Multi-view Bootstrapping
From handheld videos to 3D keypoints
2
3D Lifting
Category-agnostic 3D transformer model
3
Label-free Mixers
Unsupervised 2D→3D lifting
4
4D Rigging
Template-free articulated animation

Multi-view Bootstrapping in the Wild

Using just two handheld videos and minimal keypoint annotations to generate dense 3D data, without requiring camera calibration or template models.

Learn more

3D Lifting Foundation Model

A transformer-based model that can lift 2D observations into 3D across multiple object categories, with zero-shot generalization capabilities.

Learn more

Mixed, Not Attended

Why MLP-Mixers outperform transformers for unsupervised 3D lifting, providing robust performance without 3D annotations.

Learn more

RAT4D

Rig and Animate any object without Templates in 4D, enabling reanimation of articulated objects from casual captures.

Learn more

Interactive Exploration

Explore each component of the research in detail

Multi-view Bootstrapping in the Wild

MBW generates high-fidelity 3D keypoints from just two handheld videos and minimal 2D annotations

Minimal Annotation

Only 1-2% of frames need to be annotated, drastically reducing labeling costs for 3D data collection.

No Calibration Required

Works with uncalibrated handheld cameras, enabling in-the-wild data collection with minimal equipment.

Multi-view NRSfM

Leverages multi-view non-rigid structure from motion to create accurate 3D reconstructions from sparse views.

3D Lifting Foundation Model

3D-LFM: A category-agnostic transformer model that lifts 2D keypoints into 3D across multiple object categories

Cross-category Learning

Trains on diverse object categories simultaneously, from humans and animals to vehicles and household objects.

Permutation Equivariance

Processes keypoints in any order, enabling handling of diverse object categories without fixed rigging structures.

Zero-shot Generalization

Transfers learned 3D priors to unseen object categories without additional training.

Label-free Mixers

MLP-Mixers outperform transformers for unsupervised 3D lifting due to inductive biases aligned with geometric constraints

Unsupervised Learning

Trains without 3D ground truth, using only 2D keypoint correspondences across multiple views.

Data Efficiency

Requires significantly less training data than attention-based architectures for comparable performance.

Geometric Inductive Bias

MLP-Mixer architecture naturally encodes structural relationships that align with 3D geometric constraints.

RAT4D: Rig and Animate any object without Templates in 4D

Template-free 4D rigging enables motion retargeting and novel pose synthesis across arbitrary object categories

Template-free Rigging

Automatically discovers articulation structures without requiring category-specific templates or skeletons.

Motion Retargeting

Transfers motion between instances of the same or different categories while preserving semantic correspondence.

Novel Pose Synthesis

Generates physically plausible new poses by learning the manifold of possible articulations from examples.

Publications

Research papers supporting this thesis proposal

Multi-view Bootstrapping in the Wild

We present a method for generating dense 3D keypoints from sparse 2D annotations using only two uncalibrated video sequences, requiring minimal human labeling effort.

3D-LFM: Category-Agnostic 3D Lifting Foundation Models

A transformer-based architecture for 3D lifting that generalizes across diverse object categories with zero-shot capabilities.

Mixed, Not Attended: Efficient 3D Lifting with MLP-Mixers

Demonstrates why MLP-Mixers outperform attention-based models for unsupervised 3D reconstruction tasks.

RAT4D: Rig Any Thing in 4D without Templates

A framework for template-free 4D rigging that enables articulated animation across arbitrary object categories.

Ready to Explore More?

Download the full thesis proposal for in-depth methodology, experiments, and future research directions.

Download PDF