Skip to content

Welcome to SpaRRTa¶

We're excited to announce SpaRRTa (Spatial Relation Recognition Task), a new synthetic benchmark for evaluating spatial intelligence in Visual Foundation Models.

What is SpaRRTa?¶

SpaRRTa is a benchmark designed to evaluate how Visual Foundation Models (VFMs) encode and represent spatial relations between objects. Unlike traditional 3D benchmarks that focus on explicit metric predictions like depth estimation, SpaRRTa targets abstract, human-like relational spatial reasoning.

Key Features¶

  • Photorealistic Synthetic Data: Built with Unreal Engine 5 for high-fidelity images
  • Diverse Environments: 5 distinct environments from sparse deserts to dense urban scenes
  • Two Task Variants: Egocentric (camera-view) and Allocentric (perspective-taking) tasks
  • Comprehensive Evaluation: Support for multiple probing strategies

Why SpaRRTa?¶

Visual Foundation Models have demonstrated remarkable performance in semantic understanding, but their spatial reasoning capabilities remain understudied. SpaRRTa provides a systematic way to evaluate this critical capability, which is essential for embodied AI applications.

Getting Started¶

Check out our Getting Started Guide to begin using SpaRRTa in your research.

Citation¶

If you use SpaRRTa in your research, please cite our paper:

@article{kargin2025sparrta,
  title={SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models},
  author={Kargin, Turhan Can and Jasiński, Wojciech and Pardyl, Adam and Zieliński, Bartosz and Przewięźlikowski, Marcin},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Stay tuned for more updates!