Skip to content

SpaRRTa

Spatial Relation Recognition Task
A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models

SpaRRTa Teaser

Abstract

Visual Foundation Models (VFMs), such as DINO and CLIP, exhibit strong semantic understanding but show limited spatial reasoning capabilities, which limits their applicability to embodied systems. Recent work incorporates 3D tasks (such as depth estimation) into VFM training. However, VFM performance remains inconsistent across different tasks, raising the question: do these models truly have spatial awareness or overfit to specific 3D objectives?

To address this question, we introduce the Spatial Relation Recognition Task (SpaRRTa) benchmark, which evaluates the representations of relative positions of objects across different viewpoints. SpaRRTa can generate an arbitrary number of photorealistic images with diverse scenes and fully controllable object arrangements, along with freely accessible spatial annotations.

🎮
Unreal Engine 5
Photorealistic synthetic scenes with full control over object placement, camera positions, and environmental conditions.
🔬
Spatial Reasoning
Evaluates abstract, human-like relational understanding beyond simple depth estimation or metric prediction.
👁️
Egocentric & Allocentric
Two task variants testing camera-centric and perspective-taking spatial reasoning abilities.
📊
Comprehensive Benchmark
Evaluate 13+ VFMs across 5 diverse environments with multiple probing strategies.

Key Statistics

5
Environments
13+
VFMs Evaluated
50K+
Images
3
Probing Methods

Key Findings

Main Results

  1. Spatial information is patch-level: Spatial relations are primarily encoded at the patch level and largely obscured by global pooling

  2. 3D supervision enriches patch features: VGGT (3D-supervised) shows improvements only with selective probing, not linear probing

  3. Allocentric reasoning is challenging: All models struggle with perspective-taking tasks compared to egocentric variants

  4. Environment complexity matters: Performance degrades significantly in cluttered environments like City scenes

Environments

View All Environments →

Evaluation Pipeline

Pipeline

The SpaRRTa evaluation pipeline: (1) Set Stage with diverse assets, (2) Set Camera position, (3) Render photorealistic image, (4) Extract ground truth, (5) Run VFM and probe, (6) Calculate accuracy.

Authors

Turhan Can Kargin
Jagiellonian University
Wojciech Jasiński
Jagiellonian University, AGH
Adam Pardyl
Jagiellonian University, IDEAS NCBR
Bartosz Zieliński
Jagiellonian University
Marcin Przewięźlikowski
Jagiellonian University

Affiliations

Citation

If you find SpaRRTa useful in your research, please cite our paper:

@article{kargin2025sparrta,
  title={SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models},
  author={Kargin, Turhan Can and Jasiński, Wojciech and Pardyl, Adam and Zieliński, Bartosz and Przewięźlikowski, Marcin},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Acknowledgments

This work was supported by the Polish National Science Center and conducted at the Faculty of Mathematics and Computer Science, Jagiellonian University.