Spatial Visualization • 12 Tasks • Programmatic Generation

SpatialViz-Bench

A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs

SpatialViz-Bench targets the core spatial visualization faculty: mentally operating on images (rotate / fold / penetrate / animate) while decoupling confounding factors such as real-world scene priors, language-heavy story context, or domain knowledge.

Paper Code Dataset

FormatMultiple-choice (image options)

Scale1,180 QA pairs (programmatically generated)

AbilitiesRotation / Folding / Penetration / Animation

GoalProbe internal world model & mental simulation

News

[2025.05.28]: We released SpatialViz-Bench, the first benchmark to evaluate spatial visualization for MLLMs.
[2026.01.05]: EASI (Holistic Evaluation of Multimodal LLMs on Spatial Intelligence) integrated SpatialViz-Bench into their open-source evaluation platform. EASI on GitHub
[2026.01.26]: SpatialViz-Bench has been accepted as a poster at ICLR 2026. 🎉
[2026.02.02]: We released the code for generating data.

What is Spatial Visualization?

Spatial visualization is the ability to imagine and mentally manipulate visual images. Many real-world problems require going from processing visible information to mentally constructing and manipulating unseen structures, so as to infer hidden structure, states, and dynamics from partial cues, a setting where current MLLMs still struggle. Crucially, models with strong spatial visualization skills can serve as an efficient internal world model that supports fast, lightweight internal "what-if" scenarios for predicting action outcomes, which is far more efficient than explicitly rendering future states with large diffusion-based video generation models.

We establish a cognitive framework that decomposes spatial visualization tasks into two phases:

Observe visible cues (spatial perception).
Infer unseen structure by alternating mental transformation (spatial visualization) and temporary storage (spatial memorization).

Figure 1 placeholder — Figure 1: Overview of SpatialViz-Bench.

Challenges in Evaltuating Spatial Visualization

Mis-categorization

Spatial visualization tasks are often buried under broader categories like mathematical or logical reasoning, appearing as multimodal puzzles or 3D geometry problems. This categorization obscures the evaluation of spatial visualization as a distinct capability and focuses on "solving" a problem rather than driving research toward core spatial abilities.

Contamination risk

Most examples are drawn from publicly available sources, online IQ tests, administrative exams, and math contests, which risks overlap between training and evaluation data and undermines reliability. The modern paradigm of pretraining on vast, scraped internet data fundamentally challenges evaluation validity, a problem exacerbated by proprietary datasets that make auditing for contamination impossible.

Sparse & heterogeneous formats

The scarcity of items per subskill amplifies the impact of random error, reducing the precision of evaluations. Additionally, the heterogeneous formats of the tasks make it difficult to distinguish true reasoning failures from misinterpretations, complicating the assessment of spatial reasoning skills.

Persistent difficulty

Even with potential pretraining exposure, performance remains poor. State-of-the-art systems score just 27.64 on 3D Geometry in MM-IQ and 26.00 on Descriptive Geometry in MathVision, reflecting the inherent difficulty of these tasks.

Therefore, we need a standardized, dynamically updatable benchmark designed explicitly around spatial visualization.

SpatialViz-Bench

SpatialViz-Bench is the first benchmark designed to formally evaluate the spatial visualization capabilities of MLLMs. It is organized around 4 core sub-abilities—mental rotation, mental folding, visual penetration, and mental animation—with 3 assessment tasks designed for each, totaling 12 tasks. Each task includes 2 to 3 difficulty levels, with each level containing 40 or 50 test cases, comprising 1,180 question-answer pairs in total, mostly with image-based options to focus on visual reasoning.

Figure 2 placeholder — Figure 2: Overview of tasks in SpatialViz-Bench.

We build SpatialViz-Bench via programmatic generation (11/12 tasks) and expert design (1/12). For programmatic tasks, we integrate Python + FreeCAD to control difficulty through explicit cognitive-load parameters (e.g., number of transformation steps), while employing randomness to enhance diversity and generate distractor options with explanations for deep diagnostics.

Figure 3 placeholder — Figure 3: Programmatic generation pipeline of a data instance.

Mechanical System is manually designed because physically consistent procedural generation is difficult. Drawing on representative public simulations as references, domain experts create each problem from scratch to probe dynamic motion propagation via visual simulation rather than caption recall or theoretical derivation.

Advantags

Cognitive Framework for Task Design

A hierarchical framework based on cognitive principles guides the creation of new tasks, ensuring that the tasks are well-structured and aligned with core spatial abilities.

Unified Input Format for Consistency

A standardized input format and templates are used to reduce confounding factors, enabling fine-grained error analysis and helping to diagnose specific areas where models struggle.

Dynamic Test Bank for Evaluation Validity

Programmatic generation and the vast pool of public simulations for expert-driven question design support a dynamically updated test bank that proactively mitigates data contamination.

Examples

We present exemples of varying difficulty levels for all tasks, with each sample containing an image, question, options, answer, and explanation.

Leaderboard

Overall accuracy on SpatialViz-Bench (from paper Table 2). Use "Metric" to switch between Direct / CoT / Best.

#	Model	Source	Direct	CoT	Best

Notes: some closed-source models are reported with a single overall entry in Table 2; missing cells show "-".

Main Results

Figure 4 placeholder — Figure 4: Overall performance with 95% CI, plus sensitivity analyses across difficulty levels at both the model and task level.

Tasks in SpatialViz-Bench are vision-dependent and reasoning-intensive.
Core 3D visualization tasks reveal common model failures.
Difficulty collapse is visible only in top-tier models.
CoT benefits high-performing closed-source MLLMs but often degrades open-source counterparts.

Error Analysis

Figure 5 placeholder — Figure 5: Error type distribution comparison across representative MLLMs.

Perceptual and Spatial Transformation Errors Dominate Failures
Model Scaling Fails to Resolve Core Spatial Deficits
Deficiencies Found in Both Perception and Visualization
Pre-training Biases Drive Non-Simulative Problem Solving

Citation

If you use SpatialViz-Bench in your research, please cite:

@misc{wang2026spatialvizbenchcognitivelygroundedbenchmarkdiagnosing,
      title={SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs}, 
      author={Siting Wang and Minnan Pei and Luoyang Sun and Cheng Deng and Kun Shao and Zheng Tian and Haifeng Zhang and Jun Wang},
      year={2026},
      eprint={2507.07610},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.07610}, 
}