Entering the era of rig-aware geometric foundation models with Wayve's Rig3R | Automotive Testing Technology International

Modern self-driving vehicles rely on multi-camera rigs to see the world, but transforming those streams into 3D understanding is a challenge. Rig3R, Wayve’s latest advance in geometric foundation models, has been designed to power robust ego-motion and 3D structure estimation from embodied camera rigs

Robustly estimating the structure of the world and ego-motion of the vehicle has been a multi-decade pursuit in computer vision. Classical feature-based and photogrammetry methods have long defined the state of the art in multi-view geometry. However, recent advances have shown that large-scale, transformer-based learning can push this frontier even further. Still, these methods fall short in the autonomous vehicle setting, where multiple cameras continuously capture the world through synchronized, structured rigs.

Rig3R is the first learning-based method to explicitly use multi-camera rig constraints for accurate and robust 3D reconstruction. It achieves state-of-the-art performance in complex, real-world driving scenarios.

The solution extends geometric foundation models with the advantage of rig awareness, leveraging rig information when available and inferring rig structure and calibration when it is not. This flexibility is essential for handling the diverse and evolving sensor setups found in embodied AI systems.

Wayve Autonomous Car. — Wayve autonomous car

How Rig3R works

Rig3R is a machine learning model that takes images from multiple cameras and builds an accurate 3D understanding of the world. A large Vision Transformer (ViT-Large) processes each input image, breaking it into small patch tokens using 2D sine-cosine positional embeddings. Each patch is enriched with a compact metadata tuple: Camera ID, Timestamp and a 6D Raymap, which encodes the camera’s pose relative to the rig. These fields provide spatial and temporal context, helping the model reason across multiple views and time.

A second ViT-Large decoder attends to all images jointly, across views and time, merging visual features, metadata and geometric cues into a shared latent space. During this fusion stage, Rig3R introduces a rig encoder to inject known rig constraints, enabling geometry-aware multi-view reasoning. The model is trained to leverage rig metadata when available but remains robust even when such information is missing. This fused representation forms the core of Rig3R’s multi-view 3D understanding.

Benchmarks and setup

Rig3R is evaluated on two multi-camera driving benchmarks: the Waymo Open validation set and WayveScenes101. Waymo provides lidar-based ground-truth, while WayveScenes101 uses COLMAP reconstructions. In the version presented in the recent paper, Rig3R is trained only on the Waymo training split, making WayveScenes101 an out-of-distribution test that evaluates generalization to unseen camera rigs.

Both datasets use five-camera rigs capturing approximately 200 frames per scene at 10fps. For evaluation, two 24-frame clips per scene are extracted, spaced approximately two seconds apart. Rig3R is benchmarked against feed-forward baselines, classical structure-from-motion and rig-aware methods.

The first objective, pose estimation, is assessed using relative rotation accuracy (RRA) and relative translation accuracy (RTA) at 5° and 15° thresholds, as well as mean average accuracy (mAA) up to 30°. The second objective, 3D pointmap reconstruction, is evaluated using accuracy, completeness and chamfer distance.

Results

The figure above compares pose estimation and pointmap reconstruction across four methods — Fast3R, DUSt3R-GA, Rig3R (unstructured) and Rig3R (with rig constraints) — on the same driving scene. The progression shows how Rig3R’s rig awareness improves both geometric coherence and reconstruction quality: while baseline methods often produce noisy or spatially inconsistent pointmaps and poses, Rig3R yields sharper, more consistent structures with rig poses that align sensibly across all views.

Why metadata matters

As more metadata is introduced, both pose estimation and pointmap reconstruction improve: predicted camera trajectories progressively align with the physical rig and reconstructed pointmaps become sharper and more consistent. This analysis illustrates how such structured priors, often readily available in embodied and robotic systems, can be effectively integrated into a learned geometric model to enhance both accuracy and generalization.

Rig3R achieves strong pose estimation and dense 3D reconstruction across diverse rig setups and conditions, including changes in baseline, field of view, lighting, speed and weather. It produces stable, low-drift trajectories and metrically consistent pointmaps, even in challenging scenes and in-the-wild driving videos — maintaining performance despite partial or missing metadata.

Many open problems remain in the quest to build a spatially intelligent foundation model. In addition to scaling up training, future improvements to Rig3R could include streaming representations, handling scene motion, multimodal inputs, multiple embodiments and multi-task outputs.

This is an edited version of an article that first appeared on Wayve’s blog on October 15, 2025.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

In this Issue – 2025

In this Issue – 2025

In this Issue – September 2025

In this Issue – June 2025

Entering the era of rig-aware geometric foundation models with Wayve’s Rig3R

How Rig3R works

Benchmarks and setup

Results

Why metadata matters

Automotive Testing Expo North America opens next week!

Automotive Testing Expo North America: Two weeks to go!

FISITA and Automotive Testing Expo: Global insights into the evolution of vehicle development and testing

Entering the era of rig-aware geometric foundation models with Wayve’s Rig3R

From brake wear to brake care: Meeting Euro 7 standards

Pirelli launches Cinturato Winter 3 – the first tire developed at its new indoor winter test facility

Entering the era of rig-aware geometric foundation models with Wayve’s Rig3R

How Rig3R works

Benchmarks and setup

Results

Why metadata matters

Related Posts