Microsoft Releases Phi-4-reasoning-vision-15B, an Open-Weight Multimodal Reasoning Model
Microsoft has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal reasoning model that pushes the efficiency frontier—competitive with models 10x larger in compute while running on modest

Microsoft Releases Phi-4-reasoning-vision-15B, an Open-Weight Multimodal Reasoning Model
Microsoft has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal reasoning model that pushes the efficiency frontier—competitive with models 10x larger in compute while running on modest hardware.
The model can handle image captioning, visual question answering, document reading, homework help, and understanding computer and mobile screens. It particularly excels at math and science reasoning.
"We have competitive performance to much slower models that require ten times or more compute-time and tokens," Microsoft noted, "and better accuracy than similarly fast models, particularly when it comes to math and science reasoning."
The key insight: training efficiency. Phi-4-reasoning-vision-15B was trained on just 200 billion tokens of multimodal data—compared to over 1 trillion tokens used by competitors like Qwen 2.5 VL, Kimi-VL, and Gemma 3.
The team chose a mid-fusion architecture (projecting vision tokens into a pretrained LLM's space) over early-fusion for practical efficiency. For vision encoding, they went with SigLIP-2's dynamic-resolution variant, which outperformed multi-crop approaches on high-resolution benchmarks.
Data quality mattered more than quantity. The team manually reviewed samples, re-generated responses using GPT-4o and o4-mini for incorrect answers, and repurposed high-quality images as seeds for new training data.
The model is available on Microsoft Foundry, HuggingFace, and GitHub.
Sources
- microsoft.com— Microsoft Research Blog
- huggingface.co— HuggingFace
- github.com— GitHub
Share
Related Articles
Stay in the loop
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
