3 min Applications

Microsoft launches Phi models optimized for multimodal processing

Microsoft launches Phi models optimized for multimodal processing

Microsoft is expanding its Phi line of open-source language models with two new algorithms optimized for multimodal processing and hardware efficiency.

This is according to a report from SiliconANGLE. The first addition is the text-only Phi-4-mini. The second new model, Phi-4-multimodal, is an improved version of Phi-4-mini. It can also handle visual and audio input. Microsoft claims that both models perform significantly better than comparable alternatives at certain tasks.

Phi-4-mini contains 3.8 billion parameters, making it compact enough to run on mobile devices. Its basis is transformer-neural network architecture, as with most large language models (LLMs).

A standard transformer model analyzes the text before and after a word to understand its meaning. According to Microsoft, Phi-4-mini is based on a version of this architecture called a decoder-only transformer. That one takes a different approach. Such models analyze only the text preceding a word to determine its meaning. This reduces hardware usage and speeds up processing time.

Phi-4-mini also uses a second performance optimization technique called grouped query attention (GQA). This lowers the hardware usage of the algorithm’s attention mechanism. A language model’s attention mechanism helps determine which data points are most relevant to a given processing task.

Strong in complex reasoning

Phi-4-mini can generate text, translate existing documents and perform actions in external applications. According to Microsoft, it particularly excels at mathematical and programming tasks that require complex reasoning. The company concluded from a series of internal benchmark tests that Phi-4-mini performs such tasks with significantly better accuracy than several language models of similar size.

Microsoft’s second new model, Phi-4-multimodal, is an improved version of Phi-4-mini with 5.6 billion parameters. This model can process not only text but also images, audio, and video. Microsoft trained the model with a new technique called a Mixture of LoRAs.

Adapting an AI model to a new task usually requires changing the configuration settings that determine how the model processes data. This process can be expensive and time-consuming. Therefore, researchers often use an alternative approach called LoRA (Low-Rank Adaptation). LoRA allows a model to perform a new task by adding a few new, task-optimized evaluation criteria.

Microsoft’s Mixture of LoRA method applies this same concept to multimodal processing. To create Phi-4 multimodal, the company extended Phi-4-mini to include an optimization technique for processing audio and visual data. According to Microsoft, this technique reduces some of the drawbacks associated with other approaches to multimodal models.

Test with more than six benchmarks

Microsoft tested Phi-4 multimodal’s capabilities with more than six visual data processing benchmarks. The model achieved an average score of 72, just one point lower than OpenAI’s GPT-4. Google LLC’s Gemini Flash 2.0, an advanced large language model launched in December, scored 74.3.

Phi-4 multimodal performed even better in a series of benchmark tests using both visual and audio input. According to Microsoft, the model outperformed Gemini-2.0 Flash “by a wide margin.” Phi-4-multimodal also outperformed InternOmni, an open-source LLM specifically designed for multimodal processing that has a higher number of parameters.

Microsoft will make Phi-4-multimodal and Phi-4-mini available on Hugging Face under an MIT license, which allows commercial use.