Unreal Scene Generation¶

SpaRRTa leverages Unreal Engine 5 to generate photorealistic synthetic images with precise control over object placement, camera positions, and environmental conditions. This enables the creation of a rigorous benchmark with mathematically precise ground-truth labels.

Why Synthetic Data?¶

🎯

Precise Control

Full control over object positions, camera angles, and scene composition enables mathematically rigorous ground-truth labels.

📈

Scalability

Generate arbitrary amounts of diverse data without expensive manual annotation or data collection.

🎨

Photorealism

Unreal Engine 5's Lumen and Nanite technologies provide state-of-the-art visual fidelity.

🔄

Reproducibility

Fully deterministic generation enables exact reproduction of experimental conditions.

Evaluation Environments¶

SpaRRTa includes five diverse high-fidelity environments to ensure robust evaluation across different visual domains:

Environment Details¶

🌲 Forest🏜️ Desert🏔️ Winter Town🌉 Bridge🏙️ City

Electric Dreams Environment

A sparse forest landscape with complex foliage, uneven terrain, and natural rock formations. This environment tests spatial reasoning in organic, unstructured settings.

Source: Electric Dreams (Epic Games)
Characteristics: Complex foliage, uneven terrain, natural lighting
Objects: Bear, Fox, Tent, Rocks, Trees

Arid Landscape

A vast, arid landscape characterized by open terrain, sand dunes, and high-contrast lighting. This environment is sparse and texture-homogeneous.

Source: Realistic Desert Pack
Characteristics: Open terrain, high contrast lighting, minimal occlusion
Objects: Camel, Barrel, Cactus, Rocks

Eastern European Village

A snow-covered setting reflecting a typical small Eastern European town with cold lighting, snow textures, and village buildings.

Source: Russian Winter Town
Characteristics: Cold lighting, snow textures, village architecture
Objects: Husky, Deer, Snowman

Valley Infrastructure

A valley scene centered around a large bridge infrastructure with mixed natural and man-made elements.

Source: Automotive Bridge Scene
Characteristics: Infrastructure elements, valley terrain, mixed complexity
Objects: Bicycle, Trash Can, Vehicle

Modern Metropolis

A large-scale, modern American metropolis featuring high-rise architecture, paved roads, and complex urban geometry.

Source: City Sample (Epic Games)
Characteristics: Dense urban geometry, complex occlusion, varied lighting
Objects: Motorcycle, Traffic Cone, Fire Hydrant

Asset Library¶

Asset Selection Criteria¶

Our asset selection follows specific criteria to ensure valid spatial reasoning evaluation:

ImageNet Alignment: Objects align with common ImageNet super-categories to ensure VFMs can recognize them
Isotropic Sources: Source objects (rocks, trees, cones) are rotationally symmetric to minimize orientation ambiguity
Environmental Coherence: Objects naturally fit their respective environments (e.g., camels in desert)
Visual Distinctiveness: Objects are clearly distinguishable from backgrounds and each other

Category	Assets
Animals	Bear, Fox, Camel, Husky, Deer
Vehicles	Car, Taxi, Motorcycle, Bicycle
Nature	Trees, Rocks, Cactus
Objects	Tent, Barrel, Trash Can, Traffic Cone, Fire Hydrant, Snowman
Humans	Human agent (viewpoint for allocentric tasks)

Data Generation Pipeline¶

Pipeline Steps¶

flowchart LR
    A[Set Stage] --> B[Set Camera]
    B --> C[Render View]
    C --> D[Get Ground Truth]
    D --> E[Run Model]
    E --> F[Calculate Results]

    style A fill:#7c4dff,color:#fff
    style B fill:#7c4dff,color:#fff
    style C fill:#536dfe,color:#fff
    style D fill:#536dfe,color:#fff
    style E fill:#3f1dcb,color:#fff
    style F fill:#3f1dcb,color:#fff

1. Set Stage¶

The evaluator establishes the scene configuration:

Select environment (Forest, Desert, Winter Town, Bridge, City)
Choose source, target, and viewpoint objects from the asset library
Randomly sample object positions from a Gaussian distribution
Apply physics-aware terrain adaptation via raycasting

2. Set Camera¶

Configure the viewpoint for image capture:

Sample camera position within a defined area surrounding scene center
Orient camera toward placed objects
Validate visibility constraints (objects within field of view)
Ensure proper scene composition (no extreme clustering or distance)

3. Render View¶

Generate high-fidelity imagery using Unreal Engine 5:

Ray-traced RGB image with dynamic global illumination
Ground-truth segmentation masks for validation
Resolution: 2048x2048 - Later, it is downsampled to 224×224 (standard VFM input size)

4. Get Ground Truth¶

Extract spatial relation labels:

Calculate angular relationship between source and target objects
Apply viewpoint transformation (camera for ego, human for allo)
Filter ambiguous configurations (objects near decision boundaries)
Assign discrete label: Front, Back, Left, or Right

Geometric Ambiguity Control¶

A key challenge in spatial classification is defining precise boundaries between classes. SpaRRTa implements strict rejection sampling to eliminate label noise:

Scene Examples — Visualization of valid placement zones and ambiguity exclusion zones (red/gray). Gray zones are the ambiguity exclusion zone for Camera Viewpoint. Red zones are the ambiguity exclusion zone for Human Viewpoint.

Exclusion Zones¶

Ambiguity zones are defined as conical regions centered along the diagonals:

45°, 135°, 225°, 315° relative to the viewpoint's forward vector
Any sample where the target falls within these zones is automatically rejected
This guarantees unambiguous ground-truth labels

Rejection Sampling

The pipeline automatically discards configurations where the target object lies within ±15° of a diagonal boundary, ensuring all retained samples have mathematically precise labels.

Technical Implementation¶

Rendering Stack¶

Component	Details
Engine	Unreal Engine 5.5
Lighting	Lumen (dynamic global illumination)
Geometry	Nanite (virtualized geometry)
API	Python Editor API + UnrealCV
Hardware	2× NVIDIA RTX 2080 Ti (11GB VRAM)

Camera Configuration¶

The camera system uses Unreal Engine's CineCameraActor with standardized settings:

def create_camera():
    """
    Initialize CineCameraActor with standardized parameters.
    """
    # Filmback settings (sensor dimensions)
    sensor_width = 50.0   # mm
    sensor_height = 50.0  # mm

    # Lens settings
    focal_length = 50.0   # mm
    aperture = 10.0      # f-stop

    # Calculate FOV from sensor and focal length
    horizontal_fov = 2 * arctan(0.5 * sensor_width / focal_length)
    vertical_fov = 2 * arctan(0.5 * sensor_height / focal_length)
    # Result: ~53° horizontal FOV
    fov = (horizontal_fov, vertical_fov)

    # Create camera with these settings
    camera = CineCameraActor(location, rotation)
    camera.set_filmback(sensor_width, sensor_height)
    camera.set_focal_length(focal_length)
    camera.set_aperture(aperture)

    return camera

Object Placement Algorithm¶

Objects are placed using an iterative rejection sampling approach with physics-aware terrain adaptation:

def place_objects(environment, objects, max_attempts=10):
    """
    Place objects with physics-aware terrain adaptation.
    Uses iterative rejection sampling to ensure valid configurations.
    """
    # Sample center point for object cluster
    center_x = sample_gaussian(environment_center_x, sample_radius)
    center_y = sample_gaussian(environment_center_y, sample_radius)
    center_z = environment_base_z

    for attempt in range(max_attempts):
        valid_placement = True

        for obj in objects:
            # Sample random rotation
            rotation = random_rotation(yaw_range=(0, 360))

            # Sample X, Y positions around center using Gaussian distribution
            obj_x = random_gaussian(center_x, object_proximity_std)
            obj_y = random_gaussian(center_y, object_proximity_std)

            # Perform line trace to find ground height at (X, Y)
            ground_z = detect_ground_at_position(obj_x, obj_y, center_z)

            # Calculate proper Z position accounting for object bounds
            # Get how much object extends below its origin
            object_ground_offset = obj.get_ground_offset()

            # Place object so bottom surface touches ground
            spawn_z = ground_z - object_ground_offset + safety_margin

            position = Vector(obj_x, obj_y, spawn_z)
            obj.move_to(position, rotation)

        # Validate configuration
        for i in range(len(objects)):
            for j in range(i + 1, len(objects)):
                # Check for AABB overlap (collision detection)
                if objects[i].overlaps(objects[j]):
                    valid_placement = False
                    break

                # Check objects are not too far apart
                if objects[i].distance_to(objects[j]) > max_distance:
                    valid_placement = False
                    break

            if not valid_placement:
                break

        if valid_placement:
            return True, (center_x, center_y)

    # All attempts failed
    return False, (center_x, center_y)

Camera Sampling with Screenshot¶

Camera positioning uses iterative sampling to ensure all objects are properly framed:

def sample_camera(camera, objects, object_center, max_attempts=15):
    """
    Sample camera position and orientation with validation.
    Ensures all objects are visible and properly framed.
    """
    center_x, center_y = object_center

    # Calculate average Z position of all objects
    avg_object_z = mean([obj.get_location().z for obj in objects])

    for attempt in range(max_attempts):
        # Sample camera position around object cluster
        camera_x = random_uniform(center_x - camera_range, center_x + camera_range)
        camera_y = random_uniform(center_y - camera_range, center_y + camera_range)
        camera_z = random_uniform(avg_object_z, avg_object_z + camera_height_range)

        camera_position = Vector(camera_x, camera_y, camera_z)
        camera.move_to(camera_position)

        # Orient camera to look at centroid of all objects
        object_centroid = calculate_centroid([obj.get_location() for obj in objects])
        camera.look_at_many(objects)

        # Validate all objects are within camera FOV
        object_angles = []
        for obj in objects:
            # Calculate angle between camera forward vector and object
            angle = camera.angle_to(obj)
            object_angles.append(angle)

        max_angle = max(object_angles)
        min_angle = min(object_angles)

        # Check objects are within FOV bounds
        # Reject if any object is too far from center (outside FOV)
        if max_angle > (camera.fov - margin):
            continue  # Reject and resample

        # Reject if all objects are too close to center (too clustered)
        if max_angle < min_angle_threshold:
            continue  # Reject and resample

        # Valid camera configuration found
        return True

    # All attempts failed
    return False

Iterative Rejection Sampling

The pipeline uses iterative rejection sampling with configurable maximum attempts:

Object placement: Up to X attempts to find valid non-overlapping configurations
Camera sampling: Up to Y attempts to find valid camera positions with all objects in frame
Failed attempts are automatically discarded and resampled

Parameter Serialization

For each successfully generated scene, the pipeline serializes:

Camera intrinsics (sensor dimensions, focal length, FOV, aperture)
Camera extrinsics (position, rotation)
Object positions and rotations
All metadata saved to JSON files for reproducibility

Dataset Statistics¶

Environment	Ego Images	Allo Images
Forest	5,000	10,000
Desert	5,000	10,000
Winter Town	5,000	10,000
Bridge	5,000	10,000
City	5,000	10,000
Total	25,000	50,000

Dataset Size Rationale

Egocentric: 5,000 images sufficient for generalization
Allocentric: 10,000 images needed due to increased task complexity (perspective transformation learning)

Scene Visualizations¶

Explore interactive visualizations of generated scenes showing photorealistic renderings alongside their 3D spatial annotations and 2D top-down views.

1 / 60

Environment-Asset Relations¶

Each environment contains 3 unique object triples used for evaluation. The table below shows the complete mapping of environments to their source objects, target objects, and viewpoint configurations:

Triple ID	Source Object	Target Object	Viewpoint
Bridge-1	Truck	Tree	Camera / Human 1
Bridge-2	Bike	Trash Bin	Camera / Human 2
Bridge-3	Vespa	Trash Bin	Camera / Human 3
City-1	Vespa	Cone	Camera / Human 1
City-2	Taxi	Fire Hydrant	Camera / Human 2
City-3	Bike	Cone	Camera / Human 3
Desert-1	Truck	Rock	Camera / Human 1
Desert-2	Camel	Cactus	Camera / Human 2
Desert-3	Camel	Barrel	Camera / Human 3
Forest-1	Tree	Rock	Camera / Human 1
Forest-2	Bear	Tent	Camera / Human 2
Forest-3	Fox	Rock	Camera / Human 3
Winter-1	Truck	Tree	Camera / Human 1
Winter-2	Husky	Snowman	Camera / Human 2
Winter-3	Deer	Tree	Camera / Human 3

Viewpoint Configuration

Camera: Used for egocentric (SpaRRTa-ego) task evaluation
Human 1 / 2 / 3: Different human models used for allocentric (SpaRRTa-allo) task evaluation, each with unique poses and positions

Next: Evaluation of VFMs →