Unlimited OCR: How Baidu Solves Long Document OCR

A model that reads ten pages and stalls is not actually reading. It is guessing, page by page, with no memory of what came before.

This is the quiet failure hiding inside almost every modern OCR system. Feed it a 50-page contract and watch it choke, not because it lacks intelligence, but because its memory grows without limit until the GPU runs out of room.

Baidu's new research paper, Unlimited OCR , attacks this problem directly. The fix is not a bigger model. It is a smarter attention mechanism called Reference Sliding Window Attention, or R-SWA.

Research paper

What Is OCR, and Why Does It Break at Scale?

Optical Character Recognition (OCR) converts images of text, scanned PDFs, photographed documents, handwritten notes, into machine-readable text. Modern OCR has moved past simple character matching. End-to-end OCR models now use a unified architecture: an encoder that extracts and compresses image features, paired with a decoder that generates the final text output based on those image tokens and a prompt.

The decoder is where things go wrong. As output sequences lengthen, the accumulated key-value (KV) cache drives up memory consumption and progressively slows generation,iik8y a decline in efficiency that has no human equivalent. A person copying a book by hand does not slow down on page 40 the way a transformer does.

This is the long-horizon parsing problem. No existing model can parse ten pages of a document in a single forward pass. Every system instead resorts to page-by-page processing in a loop, resetting its memory at every step. That loop is not intelligence. It is an external scheduler patching over a structural weakness.

The Human Insight Behind R-SWA

When a person transcribes a book by hand, their attention centers on three things: the source page, the few characters they just wrote, and the next character to write. They never rescan everything already copied. They forget softly, while staying anchored to two fixed points: the source material and their immediate output.

  person copying a book

R-SWA encodes this directly into the attention mechanism. For each generated token, R-SWA attends to all reference tokens, the visual tokens and the prompt, while limiting attention over its own output to the preceding 128 tokens by default. The reference stays globally visible forever. The output window slides forward, constantly forgetting what falls too far behind.

Architecture: How Unlimited OCR Is Built

  full framework

Unlimited OCR does not start from scratch. It takes DeepSeek OCR as its baseline and retains the DeepEncoder, replacing every attention layer in the decoder with R-SWA.

Encoder

DeepEncoder is roughly 380M parameters, built from an 80M-parameter SAM-base block using windowed attention and a 300M-parameter CLIP-large block using dense global attention, connected by a 16x compression module.

A 1024×1024 PDF page compresses down to just 256 tokens. This matters because visual tokens, once encoded, never change again during decoding, they stay static while the output grows.

Decoder

The decoder is a 3B-parameter Mixture-of-Experts model with only 500M activated parameters per token, keeping inference cheap despite the large total parameter count.

KV Cache

This is the structural fix. The KV cache operates as a fixed-size queue of capacity m+n. Each time a new token generates, the oldest token at position m+1 gets evicted, keeping both compute cost and memory usage flat throughout generation, instead of growing without bound.

The Math Behind the Memory Saving

R-SWA's attention window splits into two segments. The prefix segment P contains all reference tokens and stays globally visible to every later token. The decode segment Dₙ(t) is a causal sliding window of fixed width n that moves forward with generation.

  the latency chart comparing Ds-Attn vs UoW-Attn per-call duration across decode steps

For standard attention, KV cache size grows linearly: it equals the prefix length plus every token generated so far. Under R-SWA, the cache equals the prefix length plus the smaller of (n, T), meaning it is upper-bounded by a constant, never climbing past prefix-length plus the window size, no matter how long generation runs.

In plain terms: standard attention remembers everything it has ever written. R-SWA remembers the source document fully, but only the last 128 tokens of its own output. As generation length grows large relative to the window, the ratio between R-SWA's cache size and standard attention's cache size approaches zero. Memory pressure simply stops increasing.

This shows up directly in inference speed. The standard attention kernel in DeepSeek OCR incurs growing latency with every decoding step, while Unlimited OCR's duration stays flat, a direct result of R-SWA running across every decoder layer. GPU memory follows the same pattern: linear growth for the baseline, a fixed footprint for Unlimited OCR.

The Benchmark Numbers

  OmniDocBench scores and TPS comparison

Source

Accuracy did not get traded away for efficiency. On OmniDocBench v1.5, Unlimited OCR scores 93.23% overall, a 6.22-point improvement over the DeepSeek OCR baseline, while using the identical 3B-parameter model with 500M activated parameters.

Breaking that down: text edit distance drops by 0.035 and table structure accuracy (TEDS) improves by 5.96 percentage points compared to the baseline. On the newer v1.6 benchmark, Unlimited OCR again reaches state-of-the-art at 93.92% overall, confirming the gains hold across both benchmark versions.

Long documents are where the real test happens. At 40-plus pages processed in a single pass, edit distance stays below 0.11 with 97% Distinct-35, strong consistency even at extreme document lengths.

Speed scales the same way. At 256 output tokens, both models run at nearly identical speed. But as output length grows toward 6,000 tokens, DeepSeek OCR's throughput steadily declines while Unlimited OCR's stays essentially flat, a 35% speed advantage at long output lengths.

Running Unlimited OCR Locally

Baidu released the model openly. Code and model weights are publicly available on GitHub at github.com/baidu/Unlimited-OCR.

Setup mirrors the DeepSeek OCR workflow, since Unlimited OCR shares its checkpoint lineage:

# Clone the repository
git clone https://github.com/baidu/Unlimited-OCR
cd Unlimited-OCR

# Install dependencies
pip install -r requirements.txt

# Run inference on a multi-page PDF
python infer.py \
  --model_path ./checkpoints/unlimited-ocr \
  --input_path ./sample_docs/long_document.pdf \
  --max_pages 40 \
  --window_size 128
from transformers import AutoModel, AutoTokenizer
import torch

model_path = "baidu/Unlimited-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda().eval()

result = model.parse_document(
    image_paths=["page1.png", "page2.png", "page3.png"],
    prompt="Transcribe this document with reading order.",
    sliding_window=128
)
print(result)

Trying It Online

A live demo is running on Hugging Face Spaces at huggingface.co/spaces/akhaliq/Unlimited-OCR. It lets you upload a multi-page document directly in the browser and watch the model parse it in a single pass, no local GPU setup required. It's the fastest way to see R-SWA's flat memory behavior in action before committing to a local install.

  Hugging Face Space demo interface

Limitations Worth Knowing

R-SWA is not magic. The model cannot achieve truly unlimited parsing under a finite context length, since it remains constrained by prefill length, the longer the document, the longer the prefill, even with high compression. Baidu's team is already working on longer context training and a prefill pool that would let the model fetch document chunks on demand, closer to how a human flips between pages.

Conclusion

Unlimited OCR does not solve long-document parsing by scaling up. It solves it by changing what the model is allowed to remember. Reference tokens stay permanently visible. Output tokens fade through a sliding window.

That single architectural change turns a model that used to slow down and run out of memory on long documents into one that holds steady at 40-plus pages in a single forward pass, with accuracy that beats its own baseline rather than trading against it.

The bigger idea here outlasts OCR. R-SWA is described as a general-purpose parsing attention mechanism, applicable beyond OCR to tasks like automatic speech recognition and translation, anywhere a model needs to track long-running output against a fixed reference.

Long-horizon parsing has been a bottleneck for VLMs since the first end-to-end OCR systems appeared. This paper is one of the clearest signs yet that the bottleneck was never the encoder. It was always how the decoder chose to forget.

FAQs

Q1. What is Unlimited OCR, and how is it different from traditional OCR models?

Unlimited OCR is Baidu's end-to-end OCR model that introduces Reference Sliding Window Attention (R-SWA). Unlike traditional OCR systems whose memory usage grows with document length, Unlimited OCR maintains a fixed memory footprint, enabling efficient parsing of long documents.

Q2. What is Reference Sliding Window Attention (R-SWA)?

R-SWA is a new attention mechanism that keeps all document image tokens permanently accessible while restricting attention over previously generated text to a fixed sliding window. This significantly reduces KV cache growth and maintains constant inference speed.

Q3. Can Unlimited OCR process very long documents efficiently?

Yes. Unlimited OCR can process documents exceeding 40 pages in a single forward pass while maintaining stable GPU memory usage, consistent inference speed, and state-of-the-art OCR accuracy.