DeepSeek Rolls Out Large-scale Image Recognition Mode, Entering Era of Graphic-text Interaction

DeepSeek has long secured a top-tier position in text generation, coding and logical reasoning. Yet visual capability remained its most obvious shortcoming. Real-world problems do not always come in written form. They may appear as a photograph, a chart in an academic paper, a web page screenshot, or real-life scenarios requiring spatial awareness and visual detail comprehension. With the large-scale launch of its image recognition mode, DeepSeek has filled the missing piece of multimodal understanding. The platform has officially stepped into the era of graphic-text interaction.

The whale opens its eyes: Closed beta in late April, full rollout in May

Around April 29th, many users noticed a new Image Recognition Mode entry on DeepSeek’s website and mobile app.

Positioned as a core flagship feature, the mode sits alongside Quick Mode and Expert Mode. Users selected for the gray test could upload images for AI description, analysis and in-depth understanding. Netizens shared real test cases including food packaging and concept smartphone designs. DeepSeek accurately identified brands, ingredient information and design features, and delivered practical suggestions.

That night, Chen Xiaokang, head of DeepSeek’s multimodal team, posted on social media with the caption “Now, we see you”. The attached image showed DeepSeek’s iconic whale logo taking off its eye mask and opening its eyes. The post was widely regarded as official confirmation of the new multimodal capability.

By May 9th, DeepSeek opened access to the feature for most users. An independent entry for image recognition is clearly displayed above the input box, side by side with Quick and Expert modes. The function is still marked as “Image Understanding in Internal Beta”. Even so, it marks DeepSeek’s official entry into the new era of multimodal graphic-text interaction.

Technological breakthrough: Equipping AI with “fingertips” for precise targeting

DeepSeek’s image recognition is far beyond ordinary OCR text extraction. It delivers native multimodal visual understanding that truly perceives image content and ongoing scenarios.

Alongside the feature launch, DeepSeek released a groundbreaking GitHub technical paper titled Thinking with Visual Primitives. The paper points out that poor performance of existing multimodal models in complex scenarios stems not from weak perception, but from inaccurate reference positioning. Natural language carries inherent ambiguity. When handling complex spatial layouts, pure text description easily leads to misunderstanding. The team used a simple analogy: Counting scattered coins is error-prone for humans without pointing at each coin one by one.

DeepSeek adopted an innovative solution. Instead of attaching point and box marking as an output suffix, it embeds visual positioning into its reasoning chain. The model can “think” and “point” simultaneously. It maps abstract language descriptions to specific spatial coordinates. Just like humans, it eliminates ambiguity through virtual fingertip positioning and achieves precise spatial perception.

Technical data shows the framework delivers 7,056 times higher visual compression efficiency. A 756×756 image is compressed into only 81 visual key-value entries. By comparison, Claude Sonnet 4.6 needs around 870 entries for the same image size. In the Pixmo-Count counting benchmark, DeepSeek scored 89.2%, outperforming Gemini-3-Flash at 88.2% and far ahead of GPT-5.4 at 76.6%. It reached 66.9% in maze navigation, about 17 percentage points higher than GPT-5.4. Efficient visual compression and the “thinking with visual primitives” approach enable DeepSeek to deliver top-tier multimodal performance with controlled computing costs.

Impressive real-world tests: Perception, comprehension and logical reasoning

After the gray launch of image recognition mode, users conducted a wide range of creative real-world tests, proving its strong capability in daily scenarios.

Some users uploaded streetscape photos near their workplaces. DeepSeek accurately recognized almost every building name and drew on global knowledge for judgment, without activating web search. Other users shared maze puzzles. DeepSeek adopted reverse reasoning, tracing paths backward from the endpoint and verifying routes four times before giving final answers. For the first time, the previously hidden reasoning process of large language models becomes visible to users.

The model also shows strong webpage restoration ability. After uploading a webpage screenshot, it can precisely grasp layout structures and generate highly simulated webpage demos. Many users commented that DeepSeek no longer merely “sees” images, but truly “thinks” visually. The capability greatly shortens idea validation cycles for designers and product managers.

From text reading to visual perception: Multimodal becomes standard configuration

The large-scale rollout of DeepSeek’s image recognition mode reflects a clear shift in China’s large model competition. Industry focus is shifting from pure text generation to full sensory information comprehension. The leap from text reading to visual perception greatly broadens AI application boundaries.

Meanwhile, the enterprise intelligent agent middle platform integrated with DeepSeek V4 multimodal capability has been deployed across government affairs, finance, manufacturing and other industrial scenarios. E-commerce intelligent customer service can instantly analyze uploaded product damage images and generate compensation solutions. Healthcare platforms can process imaging reports and voice consultations simultaneously to provide auxiliary diagnostic suggestions.

DeepSeek holds a different positioning for multimodal capability compared with mainstream players. It regards visual understanding not only as an input interface, but also deeply integrates vision with logical reasoning. The technical route enables visual language models to achieve higher real-scene understanding efficiency at lower costs, rather than blindly chasing high benchmark rankings.

DeepSeek completed this key visual upgrade in less than two weeks, from limited gray testing in late April to full opening on May 9th. The efficiency and depth of this visual evolution highlight the style of this leading Chinese AI firm. While competitors rely on massive multimodal parameters to chase benchmark scores, DeepSeek adopts an engineering-oriented approach to turn large models into practical “eyes” for users. From text to images, from dialogue to interactive experience, AI has entered a brand-new dimension.

A landmark preview has also emerged in the DeepSeek client model selection bar. Three options – Quick, Expert and Vision – are now displayed side by side. The Vision entry is reserved for the upcoming full-capacity DeepSeek V4 multimodal version. Its official launch is expected to bring a further disruptive transformation for developers and end users.

DeepSeek Rolls Out Large-scale Image Recognition Mode, Entering Era of Graphic-text Interaction

Published

11/05/2026

Guidelines

Connect