5 min Applications

Once again, DeepSeek suggests AI can be done much more efficiently

Should AI work with images rather than text?

Once again, DeepSeek suggests AI can be done much more efficiently

Last month, the Chinese team of researchers behind DeepSeek unveiled a new model for Optical Character Recognition (OCR). The real breakthrough behind it has gone relatively unnoticed by many. DeepSeek-OCR promises much more than a simple model launch suggests, namely the possibility of significantly more efficient AI models than previously imagined.

Expert reactions to the DeepSeek-OCR AI model have been positive. It may not be state-of-the-art and is explicitly intended as proof-of-concept. However, OpenAI co-founder Andrej Karpathy argues that DeepSeek-OCR may help rid the AI world of a misconception. “Perhaps (…) all inputs to LLMs should always be images.” Why? Images may be significantly more efficient to process by LLMs than text.

Read also: DeepSeek delayed by GPU export restrictions

Compression

The modern AI advance is characterized by an obsession with compression. Any way to reduce the data footprint yields gains in time, energy, and money. At the same time, there is currently a buying frenzy; so-called AI factories cannot be built and filled with AI chips fast enough on an astronomical scale. The assumption behind both points is that, despite all attempts to reduce data, you ultimately have to build your AI infrastructure as large and ambitious as possible.

DeepSeek-OCR suggests that one way to reduce data is being overlooked. Visual information, long a neglected portion of generative AI compared to textual use cases, seems to fit much more efficiently into the context window, or short-term memory, of an LLM. The result is that you can feed an AI model not tens of thousands of words, but perhaps dozens of pages, and that this model can then perform better. In short, pixels seem to be better compression tools for AI than text.

A relatively small visual encoder with 380 million parameters is the engine behind DeepSeek-OCR. It converts visual information into a more efficient alternative. In OCR applications, this visual information usually consists of text documents. The compressed information that the decoder extracts from these documents is then fed to a decoder consisting of only 3 billion parameters. The actual calculations are performed by activating only 570 million of these parameters. This decoder provides DeepSeek-OCR’s AI response to the initial input. With a tenfold compression of the data, the model achieves an accuracy of 97 percent.

Another world

China’s DeepSeek already caused a stir on the stock market with DeepSeek-R1 earlier this year. The AI model, which was available for free download for open-source use, proved remarkably capable for the number of parameters it consisted of (671 billion) and was by far the strongest open-source LLM at the time. Moreover, the AI training process, which would normally be much more expensive, would have been relatively inexpensive by AI standards: less than €300,000.

Although OpenAI’s models remained leading in AI benchmarks at the time, it was clear that DeepSeek was efficiently approaching that level of performance. Some controversy surrounding the creation of R1 lingered because DeepSeek may have trained the model on countless outputs from ChatGPT or the OpenAI API. You could argue that R1, in a sense, mimicked—or even compressed—the capabilities of ChatGPT.

DeepSeek’s role in the AI world seems to be reinforced with OCR. Specifically, it appears to be becoming the compression specialist for generative AI. Other AI players are reaping the benefits of this specialization because the information is openly available online. Unlike that of OpenAI, Meta, Google, Anthropic, and others, DeepSeek’s research is available to everyone. Some of these parties do publish models on an open-source basis, but Google, for example, does so selectively. Gemini 2.5 Pro is proprietary, while the much less capable Gemma 3 is not.

It is unclear exactly how other AI models work. It is possible that Google owes Gemini’s gigantic context windows to a similar compression of information. However, this is by no means certain, and Google is not providing us with the answer. What is clear, though, is that optimizations such as this compression will eventually become commonplace. The same was true for Mixture-of-Experts, the process whereby an AI model does not become fully active when prompted, but only activates the components that are needed. It does require special AI training and a smart ‘router’ that determines which components of an AI model should be activated.

Food for thought

DeepSeek-OCR itself is not a breakthrough for AI applications. The work behind it suggests that more efficient AI workloads are possible and how. However, the dust has yet to settle and some questions remain unanswered. For example, it is unknown whether LLMs will now have to automatically convert all inputs to images. Nor do we know whether DeepSeek’s approach is already being used by Google and OpenAI, among others. We also do not see the same shock on the stock market with DeepSeek-OCR as we did with R1.

The findings could advance AI in two ways. First, it is conceivable that LLMs will now handle information from prompts more efficiently. By converting text into images and compressing this visual information, very little accuracy is lost. It is also possible that much more data will become manageable for an AI model. Think of large amounts of company data, style guides, or compliance requirements. This could result in more detailed and accurate output than is currently possible.