SpaRRTa

Spatial Relation Recognition Task
A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models

🚀 Get Started 📦 GitHub 📄 Paper

Abstract¶

Visual Foundation Models (VFMs), such as DINO and CLIP, exhibit strong semantic understanding but show limited spatial reasoning capabilities, which limits their applicability to embodied systems. Recent work incorporates 3D tasks (such as depth estimation) into VFM training. However, VFM performance remains inconsistent across different tasks, raising the question: do these models truly have spatial awareness or overfit to specific 3D objectives?

To address this question, we introduce the Spatial Relation Recognition Task (SpaRRTa) benchmark, which evaluates the representations of relative positions of objects across different viewpoints. SpaRRTa can generate an arbitrary number of photorealistic images with diverse scenes and fully controllable object arrangements, along with freely accessible spatial annotations.

🎮

Unreal Engine 5

Photorealistic synthetic scenes with full control over object placement, camera positions, and environmental conditions.

🔬

Spatial Reasoning

Evaluates abstract, human-like relational understanding beyond simple depth estimation or metric prediction.

👁️

Egocentric & Allocentric

Two task variants testing camera-centric and perspective-taking spatial reasoning abilities.

📊

Comprehensive Benchmark

Evaluate 13+ VFMs across 5 diverse environments with multiple probing strategies.

Key Statistics¶

5

Environments

13+

VFMs Evaluated

50K+

Images

3

Probing Methods

Key Findings¶

Main Results

Spatial information is patch-level: Spatial relations are primarily encoded at the patch level and largely obscured by global pooling
3D supervision enriches patch features: VGGT (3D-supervised) shows improvements only with selective probing, not linear probing
Allocentric reasoning is challenging: All models struggle with perspective-taking tasks compared to egocentric variants
Environment complexity matters: Performance degrades significantly in cluttered environments like City scenes

Environments¶

🏔️ Winter Town

🌉 Bridge

View All Environments →

Evaluation Pipeline¶

Authors¶

Turhan Can Kargin

Jagiellonian University

✉️ 💻

Wojciech Jasiński

Jagiellonian University, AGH

Adam Pardyl

Jagiellonian University, IDEAS NCBR

Bartosz Zieliński

Jagiellonian University

Marcin Przewięźlikowski

Jagiellonian University

Affiliations¶

Citation¶

If you find SpaRRTa useful in your research, please cite our paper:

@article{kargin2025sparrta,
  title={SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models},
  author={Kargin, Turhan Can and Jasiński, Wojciech and Pardyl, Adam and Zieliński, Bartosz and Przewięźlikowski, Marcin},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Acknowledgments¶

This work was supported by the Polish National Science Center and conducted at the Faculty of Mathematics and Computer Science, Jagiellonian University.

Get Started with SpaRRTa View Results