Chinese language AI startup DeepSeek has launched a brand new multimodal AI mannequin, which it stated is able to processing giant and complicated paperwork utilizing considerably fewer tokens.
The Huangzhou-based firm stated that DeepSeek-OCR makes use of visible notion as a medium to compress textual content for big language fashions (LLMs) extra effectively. Each the supply code and weights of the mannequin are publicly accessible through on-line developer platforms Hugging Face and GitHub. In its analysis, DeepSeek discovered that utilizing “imaginative and prescient encoders” to compress textual content for LLMs would allow them to course of huge quantities of textual content at decrease computing prices.
“Via DeepSeek-OCR, we reveal that vision-text compression can obtain important token discount (7-20×) for various historic context phases, providing a promising path for addressing long-context challenges in giant language fashions,” the corporate stated in a technical paper accompanying the mannequin’s launch.
I fairly like the brand new DeepSeek-OCR paper. It’s a superb OCR mannequin (possibly a bit worse than dots), and sure information assortment and so forth., however anyway it doesn’t matter.
The extra fascinating half for me (esp as a pc imaginative and prescient at coronary heart who’s quickly masquerading as a pure language… https://t.co/AxRXBdoO0F
— Andrej Karpathy (@karpathy) October 20, 2025
The launch of DeepSeek-OCR displays the corporate’s continued give attention to enhancing the effectivity of LLMs whereas driving down the prices of constructing and utilizing them. The corporate is alleged to have taken the same method in creating its breakthrough open-weight fashions V3 and R1, which made waves throughout the tech trade for reaching efficiency similar to cutting-edge fashions like OpenAI’s o1 at solely a fraction of the price.
Technical specs
With DeepSeek-OCR, the corporate goals to deal with a key limitation of LLMs: dealing with lengthy contexts with out operating into reminiscence limits. Its core speculation is that processing textual content as photos may be extra computationally environment friendly than processing uncooked digital textual content. The brand new OCR mannequin serves as a proof-of-concept for this concept.
The mannequin includes two components: a 380 million-parameter DeepEncoder used to analyse every picture and produce a compressed model of it; and a 570 million-active parameter textual content generator constructed on high of one other three billion-parameter combination of specialists (MoE) language mannequin.
DeepSeek’s researchers stated that they educated the OCR mannequin with 30 million PDF pages in roughly 100 languages, together with 25 million in Chinese language and English, together with 10 million artificial diagrams, 5 million chemical formulae, and a million geometric figures.
Efficiency on benchmarks
The OCR mannequin is able to compressing textual content by as much as an element of ten whereas retaining 97 per cent of the unique info, as per the technical paper. It may be used to course of a variety of doc varieties together with plain textual content, diagrams, chemical formulae, and geometric figures whereas with the ability to preserve the unique formatting, output plain textual content, and even present basic picture descriptions. Nonetheless, the requirement of ‘imaginative and prescient tokens’ can also be more likely to differ based mostly on the doc measurement and picture decision.
Story continues beneath this advert
In sum, DeepSeek-OCR can generate coaching information for LLMs and imaginative and prescient language fashions (VLMs) at a scale of greater than 200,000 pages per day whereas operating on a single Nvidia A100 GPU.
The OCR mannequin was evaluated on two benchmarks, the OmniDocBench check that’s used to judge a mannequin’s doc parsing capabilities and the Fox benchmark check used to judge the focusing capabilities of imaginative and prescient language fashions on dense PDF paperwork.
“On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/web page) utilizing solely 100 imaginative and prescient tokens, and outperforms MinerU2.0 (6000+ tokens per web page on common) whereas utilising fewer than 800 imaginative and prescient tokens,” the paper learn.

