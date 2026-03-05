Microsoft has released a new multimodal reasoning model: Phi-4-reasoning-vision-15B. The model combines two existing algorithms using a mid-fusion approach and can analyze images, scientific graphs, and screen interfaces. Despite its smaller size, it outperforms comparable models on mathematical and scientific benchmarks.

The model builds on two existing algorithms: SigLIP-2 and Phi-4 Reasoning, a reasoning model that Microsoft made available as open source last year. SigLIP-2 converts images into a numerical format that neural networks can process.

The two algorithms are combined using a technique called mid-fusion. Unlike models in which all layers support multimodal processing, in Phi-4-reasoning-vision-15B only some of the layers do so. This leads to less hardware usage at the expense of some output quality.

It is noteworthy that the reasoning functionality can be enabled and disabled via prompts. Users who want to further reduce the infrastructure load can simply disable the reasoning option.

Training on open-source data and corrected captions

For training, Microsoft primarily used open-source data comprising images and text descriptions. The company went through a multi-step process to improve quality. High-quality datasets were set aside. Images with incorrect captions were given new descriptions, generated with GPT-4o and o4-mini. In addition, Microsoft added internally generated training data, data from targeted acquisitions, and examples of behavior the model should avoid.

On the MathVista_Mini benchmark, Phi-4-reasoning-vision-15B scored 17 percent higher than Google’s gemma-3-12b-it. This is a benchmark specific to multimodal mathematics. The model also achieved higher scores on more than half a dozen other evaluations.

Deployable for AI agents and visual analysis

Developers can use the model to build AI agents that interact with applications via the user interface. Phi-4-reasoning-vision-15B can deduce the functions of interface elements, such as buttons and menus, from screenshots. In addition, the model is suitable for analyzing complex visual files.

Microsoft has made the code available via Hugging Face, GitHub, and Azure.