Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

1University of Cambridge 2Pioneer Center for AI, University of Copenhagen 3University of California San Diego 4Institute of Automation, CAS 5Peking University 6Hong Kong University of Science and Technology
*Equal Contribution โ€ Corresponding Author

Abstract

Vision-Language Models have excelled at textual reasoning, but they often struggle with finegrained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control, such as agent icons and tangram shapes, enabling it to maintain high visual consistency and adapt its planning capability robustly to unseen patterns. (3) Visual Test-Time Scaling: We observe a test-time scaling law in sequential planning; increasing the generated video length (visual inference budget) empowers better zero-shot generalization to spatially and temporally complex paths. These findings suggest that video generation is not merely a media tool, but a scalable, generalizable paradigm for visual reasoning.

Teaser Overview Image

Video generation models as visual reasoners, empowered by (1) enriched visual context for improved geometric control and (2) visual test-time scaling that allocates a larger inference-frame budget and enables stronger performance on long-horizon, complex sequential planning tasks, together demonstrating robust generalization across diverse scenarios.

Visual Test-Time Scaling

(1) Number of Inference Frames vs. Task Performance

Test Time Scaling - Total Frames

We find that increasing the total frame count (e.g., from 81 to 101, 121 frames) improves navigation performance on both spatially (maze size) and temporally (path length) OOD tasks. OOD performance increases steadily as we scale the inference budget from 61 to 121 frames. However, we observe a ceiling effect: when scaling to 141 frames for temporal OOD cases, performance drops compared to 121 frames, though it remains superior to the training baseline of 81 frames. We attribute this drop to the architectural limits of the video generation model's positional embeddings, which struggle to extrapolate when the frame count deviates significantly from the training distribution.

(2) Scaling Factors per Steps vs. Task Performance

Test Time Scaling - Scaling Factors

To rigorously probe whether these gains stem from finer-grained reasoning or simply longer video duration, we introduce a control variable ฮบ (scaling factor), defined as the number of frames allocated per discrete step in the maze solution. We tested ฮบ โˆˆ {5, 7, 9, 11} across spatially and temporally OOD settings. As shown in the figure above, we observe a clear positive correlation: assigning more frames per step (ฮบ=7, 9, 11) significantly improves performance on spatially ID and OOD settings compared to lower resolutions (ฮบ=5). Notably, in temporal OOD settings, performance peaks at ฮบ=9 before degrading at ฮบ=11. This degradation aligns with the positional embedding limitation noted above, as ฮบ=11 pushes the total video length to ~200 frames. Crucially, this drop is not observed in the spatially OOD setting where total path length is in-distribution, confirming that the degradation is limited by the sequence length capacity rather than logical.

Qualitative Examples of Video Reasoning

Maze 4x4
Maze 5x5
Maze 6x6
Maze 6x6 (Variant)
Maze 6x6 OOD
Maze 7x7 OOD
Maze 7x7 OOD (Variant)
Maze 8x8 OOD
Fade (142)
Fade (172)
Rotation (142)
Rotation (172)
Translation (142)
Translation (172)

Quantitative Results

Maze Navigation Quantitative Results

EM = Exact Match, PR = Progress Rate. Proprietary models evaluated zero-shot; open-sourced models fine-tuned. Cyan represents visual reasoning system.

Model Input Output In Distribution OOD Maze Sizes OOD Path Length OOD Both
3x3 -- 6x6 7x7 8x8 5x5 (Long) 6x6 (Long) 7x7 (Long) 8x8 (Long)
EMPR EMPREMPR EMPREMPR EMPREMPR
Proprietary Models
GPT-5.1 ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ“– 10.610.7 6.326.726.006.00 0000 0000
GPT-5.2 ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ“– 12.512.5 8.408.408.408.40 0000 0000
Open-Sourced Models (All Fine-Tuned)
Qwen3-VL-8B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ“– 58.368.6 20.037.319.234.3 013.3013.2 011.308.9
  - w coordinates 72.077.3 33.245.022.030.5 017.1013.4 08.105.9
VPRL-7B * ๐Ÿ–ผ๏ธ๐Ÿ–ผ๏ธ 73.578.6 14.025.24.006.20 011.02.0016.7 04.1000.70
Wan2.2-TI2V-5B ๐Ÿ“–+๐Ÿ–ผ๏ธ๐ŸŽž๏ธ 96.099.0 90.092.380.083.6 44.055.242.051.6 40.051.132.047.1
  - Unseen Visual Icons 95.598.2 92.092.678.081.6 36.046.342.052.0 38.047.932.042.3

Tangram Quantitative Results

GC. = Goal Completion, BA. = Boundary Adherence. Models are tested on seen puzzle patterns during training for learnability, and unseen patterns for generalizability.

Model Input Output Seen (Learnability) Unseen (Generalizability)
Strict GC Progress GC BA Strict GC Progress GC BA
Fade-In
Qwen-Image-Edit-20B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ–ผ๏ธ 31.082.399.8 32.081.399.7
Wan2.2-TI2V-5B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐ŸŽž๏ธ 0.8049.498.1 0.8048.998.0
Rotation
Qwen3-VL-8B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ“– 14.469.789.5 1.652.180.8
Nano Banana ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ–ผ๏ธ --- 9.8043.464.7
Qwen-Image-Edit-20B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ–ผ๏ธ 45.287.599.7 43.285.799.6
Wan2.2-TI2V-5B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐ŸŽž๏ธ 22.476.898.1 22.474.598.0
Translation
Qwen3-VL-8B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ“– 28.075.791.4 1.6058.982.4
Nano Banana ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ–ผ๏ธ --- 3.9051.374.5
Qwen-Image-Edit-20B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐Ÿ–ผ๏ธ 85.797.799.9 76.095.499.7
Wan2.2-TI2V-5B ๐Ÿ“–+๐Ÿ–ผ๏ธ ๐ŸŽž๏ธ 68.094.797.0 60.892.097.0

More Qualitative Examples

BibTeX

@article{li2026thinkingframes,
  title={Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning},
  author={Chengzu Li and Zanyi Wang and Jiaang Li and Yi Xu and Han Zhou and Huanyu Zhang and Ruichuan An and Dengyang Jiang and Zhaochong An and Ivan Vuliฤ‡ and Serge Belongie and Anna Korhonen},
  journal={arXiv preprint arXiv:2601.21037},
  year={2026}
}