3D Data Generation at Internet Scale
Read ProposalExplore ComponentsA self-bootstrapping pipeline for converting raw pixels into large-scale 3D and 4D spatial understanding
Modern AI won't achieve physical intelligence until it can extract rich, semantic spatial knowledge from the wild ocean of internet video—not just curated motion-capture datasets or expensive 3D scans. This thesis proposes a self-bootstrapping pipeline for converting raw pixels into large-scale 3D and 4D spatial understanding.
It begins with multi-view bootstrapping: using just two handheld videos and ~1% 2D keypoints to produce dense 2D, 3D keypoints—no calibration, no 3D ground truth required. This sets the stage for geometry-only supervision at scale.
The 3D lifting foundation model then builds a category-agnostic transformer with zero-shot generalization. It's trained using our bootstrapped data, enabling consistent 3D lifting across diverse object categories from humans to vehicles.
For unsupervised 3D generation, label-free mixers outperform attention architectures for 2D→3D lifting due to data efficiency and inductive biases aligned with geometric constraints. This unlocks large-scale 3D generation without human annotation.
Finally, template-free 4D rigging enables motion retargeting and novel pose synthesis without predefined skeletal structures. This opens the door to internet-scale 4D understanding across unlimited object categories.
A multi-stage pipeline for creating physical intelligence from raw pixels
Using just two handheld videos and minimal keypoint annotations to generate dense 3D data, without requiring camera calibration or template models.
Learn moreA transformer-based model that can lift 2D observations into 3D across multiple object categories, with zero-shot generalization capabilities.
Learn moreWhy MLP-Mixers outperform transformers for unsupervised 3D lifting, providing robust performance without 3D annotations.
Learn moreRig and Animate any object without Templates in 4D, enabling reanimation of articulated objects from casual captures.
Learn moreExplore each component of the research in detail
MBW generates high-fidelity 3D keypoints from just two handheld videos and minimal 2D annotations
Only 1-2% of frames need to be annotated, drastically reducing labeling costs for 3D data collection.
Works with uncalibrated handheld cameras, enabling in-the-wild data collection with minimal equipment.
Leverages multi-view non-rigid structure from motion to create accurate 3D reconstructions from sparse views.
3D-LFM: A category-agnostic transformer model that lifts 2D keypoints into 3D across multiple object categories
Trains on diverse object categories simultaneously, from humans and animals to vehicles and household objects.
Processes keypoints in any order, enabling handling of diverse object categories without fixed rigging structures.
Transfers learned 3D priors to unseen object categories without additional training.
MLP-Mixers outperform transformers for unsupervised 3D lifting due to inductive biases aligned with geometric constraints
Trains without 3D ground truth, using only 2D keypoint correspondences across multiple views.
Requires significantly less training data than attention-based architectures for comparable performance.
MLP-Mixer architecture naturally encodes structural relationships that align with 3D geometric constraints.
Template-free 4D rigging enables motion retargeting and novel pose synthesis across arbitrary object categories
Automatically discovers articulation structures without requiring category-specific templates or skeletons.
Transfers motion between instances of the same or different categories while preserving semantic correspondence.
Generates physically plausible new poses by learning the manifold of possible articulations from examples.
Research papers supporting this thesis proposal
We present a method for generating dense 3D keypoints from sparse 2D annotations using only two uncalibrated video sequences, requiring minimal human labeling effort.
A transformer-based architecture for 3D lifting that generalizes across diverse object categories with zero-shot capabilities.
Demonstrates why MLP-Mixers outperform attention-based models for unsupervised 3D reconstruction tasks.
Download the full thesis proposal for in-depth methodology, experiments, and future research directions.
Download PDF