Skip to content

Evaluation of Visual Foundation Models¶

SpaRRTa provides a systematic methodology for evaluating how Visual Foundation Models (VFMs) encode and represent spatial relations between objects.

The Spatial Relation Recognition Task¶

The core task is to determine the relative spatial relation between two objects in an image:

  • Source Object: The reference point for the spatial relation
  • Target Object: The object whose position is being queried
  • Viewpoint: The perspective from which the relation is evaluated

Ego vs Allo Ego vs Allo

Left: Egocentric task (camera viewpoint). Right: Allocentric task (human viewpoint).

Task Variants¶

The camera defines the viewpoint for spatial relations.

Example Query

"Where is the tree (target) relative to the car (source) from the camera's perspective?"

Characteristics:

  • Directly observable from input image
  • Simpler—no perspective transformation needed
  • Tests basic spatial layout understanding

Answer: The relation as seen from the camera (Front/Back/Left/Right)

A third object (human) defines the viewpoint for spatial relations.

Example Query

"Where is the tree (target) relative to the car (source) from the human's perspective?"

Characteristics:

  • Requires implicit perspective transformation
  • More challenging—must reason from another viewpoint
  • Tests abstract spatial reasoning capability

Answer: The relation as would be seen from the human's position

Classification Labels¶

The task is formulated as a 4-way classification:

Label Description
Front Target is in front of source (from viewpoint)
Back Target is behind source (from viewpoint)
Left Target is to the left of source (from viewpoint)
Right Target is to the right of source (from viewpoint)

Evaluated Models¶

We evaluate a diverse suite of VFMs spanning different learning paradigms:

Joint-Embedding Architectures (JEA)¶

Model Backbone Pre-training Dataset
DINO ViT-B/16 Contrastive / Distillation ImageNet-1k
DINO-v2 ViT-B/14 DINO + iBOT LVD-142M
DINO-v2 (+reg) ViT-B/14, ViT-L/14 DINO-v2 w/ Register Tokens LVD-142M
DINOv3 ViT-B/16 DINO + iBOT LVD-1689M

Masked Image Modeling (MIM)¶

Model Backbone Pre-training Dataset
MAE ViT-B/16 Pixel Reconstruction ImageNet-1k
MaskFeat ViT-B/16 HOG Feature Prediction ImageNet-1k
SPA ViT-B/16 Masked Volumetric Neural Rendering ScanNet, Hypersim, S3DIS
CroCo ViT-B/16 Cross-View Completion Habitat
CroCov2 ViT-B/16 Cross-View Completion ARKitScenes, MegaDepth, ...

Supervised & Weakly Supervised¶

Model Backbone Pre-training Dataset
VGGT ViT-L/14 Multi-Task 3D Regression Co3D, MegaDepth, etc.
DeiT ViT-B/16 Classification + Distillation ImageNet-1k
CLIP ViT-B/16 Image-Text Contrastive Web Image-Text (WIT)

Probing Methodology¶

We evaluate frozen VFM representations using lightweight probing heads:

Probing Methods Probing Methods Probing Methods

Three probing strategies: Linear Probing with GAP, AbMILP, and Efficient Probing.

Probing Strategies¶

Global Average Pooling + Linear Classifier

features = vfm.extract_patches(image)  # [N, D]
global_feat = features.mean(dim=0)     # [D]
prediction = linear_layer(global_feat)  # [4]

Pros: Simple baseline, standard evaluation protocol

Cons: Treats all patches equally, loses local spatial information

Attention-Based Multiple Instance Learning Pooling

features = vfm.extract_patches(image)      # [N, D]
attention = attention_mlp(features)        # [N, 1]
attention = softmax(attention, dim=0)      # [N, 1]
weighted_feat = (attention * features).sum(dim=0)  # [D]
prediction = linear_layer(weighted_feat)   # [4]

Pros: Learns to focus on relevant patches

Cons: Single attention map may not capture multiple objects

Multi-Query Cross-Attention

features = vfm.extract_patches(image)      # [N, D]
queries = learnable_queries                # [Q, D']
attended = cross_attention(queries, features)  # [Q, D']
prediction = linear_layer(attended.flatten())  # [4]

Pros: Multiple queries can specialize to different objects/regions

Cons: More parameters, may overfit on small datasets

Hyperparameters¶

Parameter Linear AbMILP Efficient
Optimizer AdamW AdamW AdamW
Learning Rate 1e-2, 1e-3, 1e-4 1e-2, 1e-3, 1e-4 1e-2, 1e-3, 1e-4
Weight Decay 0.001 0.001 0.001
Dropout 0.2, 0.4, 0.6 0.2, 0.4, 0.6 0.2, 0.4, 0.6
Batch Size 256 256 256
Epochs 1000 500 500
Warmup Steps 200 100 100
Queries (Q) - - 4

Evaluation Protocol¶

Data Splits¶

For each environment and object triple:

  • Training: 80%
  • Validation: 10% (hyperparameter selection)
  • Test: 10% (final evaluation)

Metrics¶

  • Accuracy: Primary metric (4-way classification)
  • Mean Rank: Model ranking across environments/tasks
  • Per-Environment Accuracy: Fine-grained analysis

Reproducibility¶

  • Random Seeds: 2 seeds per experiment
  • Object Triples: 3 distinct triples per environment
  • Cross-Validation: Validation set for best checkpoint selection

Key Insights¶

Performance Hierarchy¶

graph LR
    A[Linear Probing] -->|"< worse"| B[AbMILP]
    B -->|"< worse"| C[Efficient Probing]

    style A fill:#ff6b6b,color:#fff
    style B fill:#feca57,color:#000
    style C fill:#1dd1a1,color:#fff

Main Finding

Spatial information is primarily encoded at the patch level and is largely obscured by global pooling. Selective probing mechanisms (AbMILP, Efficient Probing) consistently outperform linear probing.

Model Rankings¶

Scatter Comparison

Impact of probing strategy on spatial accuracy across all VFMs.

Top Performers:

  1. VGGT (with Efficient Probing) - Best overall spatial reasoning
  2. DINO-v2 (+reg) ViT-L - Strong across all probing methods
  3. DINOv3 - Excellent with Efficient Probing
  4. MAE - Surprisingly strong performance

Underperformers:

  • CLIP - Limited spatial awareness
  • DeiT - Semantic features don't transfer to spatial tasks

Task Difficulty¶

Easy vs Hard

Egocentric vs Allocentric performance comparison across all VFMs.

Allocentric Challenge

All models show significant performance drops on allocentric tasks compared to egocentric. This indicates that perspective-taking remains a fundamental challenge for current VFMs.