Welcome to SpaRRTa¶

We're excited to announce SpaRRTa (Spatial Relation Recognition Task), a new synthetic benchmark for evaluating spatial intelligence in Visual Foundation Models.

What is SpaRRTa?¶

SpaRRTa is a benchmark designed to evaluate how Visual Foundation Models (VFMs) encode and represent spatial relations between objects. Unlike traditional 3D benchmarks that focus on explicit metric predictions like depth estimation, SpaRRTa targets abstract, human-like relational spatial reasoning.

Key Features¶

Photorealistic Synthetic Data: Built with Unreal Engine 5 for high-fidelity images
Diverse Environments: 5 distinct environments from sparse deserts to dense urban scenes
Two Task Variants: Egocentric (camera-view) and Allocentric (perspective-taking) tasks
Comprehensive Evaluation: Support for multiple probing strategies

Why SpaRRTa?¶

Visual Foundation Models have demonstrated remarkable performance in semantic understanding, but their spatial reasoning capabilities remain understudied. SpaRRTa provides a systematic way to evaluate this critical capability, which is essential for embodied AI applications.

Getting Started¶

Check out our Getting Started Guide to begin using SpaRRTa in your research.

Citation¶

If you use SpaRRTa in your research, please cite our paper:

@misc{kargin2026sparrta,
      title={SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models}, 
      author={Turhan Can Kargin and Wojciech Jasiński and Adam Pardyl and Bartosz Zieliński and Marcin Przewięźlikowski},
      year={2026},
      eprint={2601.11729},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.11729}, 
}

Stay tuned for more updates!