Microsoft launches Phi models optimized for multimodal processing

Microsoft is expanding its Phi line of open-source language models with two new algorithms optimized for multimodal processing and hardware efficiency.

This is according to a report from SiliconANGLE. The first addition is the text-only Phi-4-mini. The second new model, Phi-4-multimodal, is an improved version of Phi-4-mini. It can also handle visual and audio input. Microsoft claims that both models perform significantly better than comparable alternatives at certain tasks.

Phi-4-mini contains 3.8 billion parameters, making it compact enough to run on mobile devices. Its basis is transformer-neural network architecture, as with most large language models (LLMs).

A standard transformer model analyzes the text before and after a word to understand its meaning. According to Microsoft, Phi-4-mini is based on a version of this architecture called a decoder-only transformer. That one takes a different approach. Such models analyze only the text preceding a word to determine its meaning. This reduces hardware usage and speeds up processing time.

Phi-4-mini also uses a second performance optimization technique called grouped query attention (GQA). This lowers the hardware usage of the algorithm’s attention mechanism. A language model’s attention mechanism helps determine which data points are most relevant to a given processing task.

Strong in complex reasoning

Phi-4-mini can generate text, translate existing documents and perform actions in external applications. According to Microsoft, it particularly excels at mathematical and programming tasks that require complex reasoning. The company concluded from a series of internal benchmark tests that Phi-4-mini performs such tasks with significantly better accuracy than several language models of similar size.

Microsoft’s second new model, Phi-4-multimodal, is an improved version of Phi-4-mini with 5.6 billion parameters. This model can process not only text but also images, audio, and video. Microsoft trained the model with a new technique called a Mixture of LoRAs.

Adapting an AI model to a new task usually requires changing the configuration settings that determine how the model processes data. This process can be expensive and time-consuming. Therefore, researchers often use an alternative approach called LoRA (Low-Rank Adaptation). LoRA allows a model to perform a new task by adding a few new, task-optimized evaluation criteria.

Microsoft’s Mixture of LoRA method applies this same concept to multimodal processing. To create Phi-4 multimodal, the company extended Phi-4-mini to include an optimization technique for processing audio and visual data. According to Microsoft, this technique reduces some of the drawbacks associated with other approaches to multimodal models.

Test with more than six benchmarks

Microsoft tested Phi-4 multimodal’s capabilities with more than six visual data processing benchmarks. The model achieved an average score of 72, just one point lower than OpenAI’s GPT-4. Google LLC’s Gemini Flash 2.0, an advanced large language model launched in December, scored 74.3.

Phi-4 multimodal performed even better in a series of benchmark tests using both visual and audio input. According to Microsoft, the model outperformed Gemini-2.0 Flash “by a wide margin.” Phi-4-multimodal also outperformed InternOmni, an open-source LLM specifically designed for multimodal processing that has a higher number of parameters.

Microsoft will make Phi-4-multimodal and Phi-4-mini available on Hugging Face under an MIT license, which allows commercial use.

Joint AI training without sharing data: FlexOlmo makes it possible

Researchers at the Allen Institute for Artificial Intelligence (AI2) have presented a new framework for train...

Mels Dees July 11, 2025

Citrix returns to the mainstream hypervisor market

Citrix is trying to regain a foothold in the general hypervisor market. The company is seizing the momentum t...

Mels Dees July 10, 2025

Top story

Replatforming virtualized workloads: Do your VMs need a new home?

Finding a balance for VMs and containers

Sander Almekinders 6 hours ago

Tech calendar

Microsoft launches Phi models optimized for multimodal processing

Strong in complex reasoning

Test with more than six benchmarks

Stay tuned, subscribe!

Replatforming virtualized workloads: Do your VMs need a new home?

Domain-specific AI beats general models in business applications

SAS gives data scientists the steering wheel for the AI (agents) era

SAS launches tailor-made AI models for business processes

Tableau Pulse uses generative AI to create data analysis on its own

47 years of SAS: age gives SAS an edge in current AI landscape

Krijg Volledig Inzicht van Gebruiker tot Cloud met Cisco ThousandEyes

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices