SpatialViz-Bench
A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
News
- [2025.05.28]: We released SpatialViz-Bench, the first benchmark to evaluate spatial visualization for MLLMs.
- [2026.01.05]: EASI (Holistic Evaluation of Multimodal LLMs on Spatial Intelligence) integrated SpatialViz-Bench into their open-source evaluation platform. EASI on GitHub
- [2026.01.26]: SpatialViz-Bench has been accepted as a poster at ICLR 2026. 🎉
- [2026.02.02]: We released the code for generating data.
What is Spatial Visualization?
Spatial visualization is the ability to imagine and mentally manipulate visual images. Many real-world problems require going from processing visible information to mentally constructing and manipulating unseen structures, so as to infer hidden structure, states, and dynamics from partial cues, a setting where current MLLMs still struggle. Crucially, models with strong spatial visualization skills can serve as an efficient internal world model that supports fast, lightweight internal "what-if" scenarios for predicting action outcomes, which is far more efficient than explicitly rendering future states with large diffusion-based video generation models.
We establish a cognitive framework that decomposes spatial visualization tasks into two phases:
- Observe visible cues (spatial perception).
- Infer unseen structure by alternating mental transformation (spatial visualization) and temporary storage (spatial memorization).
Challenges in Evaltuating Spatial Visualization
Mis-categorization
Spatial visualization tasks are often buried under broader categories like mathematical or logical reasoning, appearing as multimodal puzzles or 3D geometry problems. This categorization obscures the evaluation of spatial visualization as a distinct capability and focuses on "solving" a problem rather than driving research toward core spatial abilities.
Contamination risk
Most examples are drawn from publicly available sources, online IQ tests, administrative exams, and math contests, which risks overlap between training and evaluation data and undermines reliability. The modern paradigm of pretraining on vast, scraped internet data fundamentally challenges evaluation validity, a problem exacerbated by proprietary datasets that make auditing for contamination impossible.
Sparse & heterogeneous formats
The scarcity of items per subskill amplifies the impact of random error, reducing the precision of evaluations. Additionally, the heterogeneous formats of the tasks make it difficult to distinguish true reasoning failures from misinterpretations, complicating the assessment of spatial reasoning skills.
Persistent difficulty
Even with potential pretraining exposure, performance remains poor. State-of-the-art systems score just 27.64 on 3D Geometry in MM-IQ and 26.00 on Descriptive Geometry in MathVision, reflecting the inherent difficulty of these tasks.
Therefore, we need a standardized, dynamically updatable benchmark designed explicitly around spatial visualization.
SpatialViz-Bench
SpatialViz-Bench is the first benchmark designed to formally evaluate the spatial visualization capabilities of MLLMs. It is organized around 4 core sub-abilities—mental rotation, mental folding, visual penetration, and mental animation—with 3 assessment tasks designed for each, totaling 12 tasks. Each task includes 2 to 3 difficulty levels, with each level containing 40 or 50 test cases, comprising 1,180 question-answer pairs in total, mostly with image-based options to focus on visual reasoning.
We build SpatialViz-Bench via programmatic generation (11/12 tasks) and expert design (1/12). For programmatic tasks, we integrate Python + FreeCAD to control difficulty through explicit cognitive-load parameters (e.g., number of transformation steps), while employing randomness to enhance diversity and generate distractor options with explanations for deep diagnostics.
Mechanical System is manually designed because physically consistent procedural generation is difficult. Drawing on representative public simulations as references, domain experts create each problem from scratch to probe dynamic motion propagation via visual simulation rather than caption recall or theoretical derivation.
Advantags
Cognitive Framework for Task Design
A hierarchical framework based on cognitive principles guides the creation of new tasks, ensuring that the tasks are well-structured and aligned with core spatial abilities.
Unified Input Format for Consistency
A standardized input format and templates are used to reduce confounding factors, enabling fine-grained error analysis and helping to diagnose specific areas where models struggle.
Dynamic Test Bank for Evaluation Validity
Programmatic generation and the vast pool of public simulations for expert-driven question design support a dynamically updated test bank that proactively mitigates data contamination.
Examples
We present exemples of varying difficulty levels for all tasks, with each sample containing an image, question, options, answer, and explanation.
Leaderboard
Overall accuracy on SpatialViz-Bench (from paper Table 2). Use "Metric" to switch between Direct / CoT / Best.
| # | Model | Source | Direct | CoT | Best |
|---|
Notes: some closed-source models are reported with a single overall entry in Table 2; missing cells show "-".
Main Results
- Tasks in SpatialViz-Bench are vision-dependent and reasoning-intensive.
- Core 3D visualization tasks reveal common model failures.
- Difficulty collapse is visible only in top-tier models.
- CoT benefits high-performing closed-source MLLMs but often degrades open-source counterparts.
Error Analysis
- Perceptual and Spatial Transformation Errors Dominate Failures
- Model Scaling Fails to Resolve Core Spatial Deficits
- Deficiencies Found in Both Perception and Visualization
- Pre-training Biases Drive Non-Simulative Problem Solving
Citation
If you use SpatialViz-Bench in your research, please cite:
@misc{wang2026spatialvizbenchcognitivelygroundedbenchmarkdiagnosing,
title={SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs},
author={Siting Wang and Minnan Pei and Luoyang Sun and Cheng Deng and Kun Shao and Zheng Tian and Haifeng Zhang and Jun Wang},
year={2026},
eprint={2507.07610},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.07610},
}