How Deepseek Ocr fits into a Paperclip company.

Deepseek Ocr drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md447 linesmarkdown
Expand
1---2name: deepseek-ocr3description: Expert skill for using DeepSeek-OCR, a vision-language model for optical character recognition with context optical compression supporting documents, PDFs, and images.4triggers:5  - use deepseek ocr6  - extract text from image with deepseek7  - ocr pdf with deepseek8  - convert document to markdown deepseek9  - deepseek ocr inference10  - run deepseek ocr on images11  - deepseek optical character recognition12  - document ocr with vllm deepseek13---14 15# DeepSeek-OCR16 17> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.18 19DeepSeek-OCR is a vision-language model for Optical Character Recognition with "Contexts Optical Compression." It supports native and dynamic resolutions, multiple prompt modes (document-to-markdown, free OCR, figure parsing, grounding), and can be run via vLLM (high-throughput) or HuggingFace Transformers. It processes images and PDFs, outputting structured text or markdown.20 21---22 23## Installation24 25### Prerequisites26- CUDA 11.8+, PyTorch 2.6.027- Python 3.12.9 (via conda recommended)28 29### Setup30 31```bash32git clone https://github.com/deepseek-ai/DeepSeek-OCR.git33cd DeepSeek-OCR34 35conda create -n deepseek-ocr python=3.12.9 -y36conda activate deepseek-ocr37 38# Install PyTorch with CUDA 11.839pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \40  --index-url https://download.pytorch.org/whl/cu11841 42# Download vllm-0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.543pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl44 45pip install -r requirements.txt46pip install flash-attn==2.7.3 --no-build-isolation47```48 49### Alternative: upstream vLLM (nightly)50 51```bash52uv venv53source .venv/bin/activate54uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly55```56 57---58 59## Model Download60 61Model is available on HuggingFace: `deepseek-ai/DeepSeek-OCR`62 63```python64from huggingface_hub import snapshot_download65snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")66```67 68---69 70## Inference: vLLM (Recommended for Production)71 72### Single Image — Streaming73 74```python75from vllm import LLM, SamplingParams76from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor77from PIL import Image78 79llm = LLM(80    model="deepseek-ai/DeepSeek-OCR",81    enable_prefix_caching=False,82    mm_processor_cache_gb=0,83    logits_processors=[NGramPerReqLogitsProcessor]84)85 86image = Image.open("document.png").convert("RGB")87prompt = "<image>\nFree OCR."88 89sampling_params = SamplingParams(90    temperature=0.0,91    max_tokens=8192,92    extra_args=dict(93        ngram_size=30,94        window_size=90,95        whitelist_token_ids={128821, 128822},  # <td>, </td> for table support96    ),97    skip_special_tokens=False,98)99 100outputs = llm.generate(101    [{"prompt": prompt, "multi_modal_data": {"image": image}}],102    sampling_params103)104 105print(outputs[0].outputs[0].text)106```107 108### Batch Images109 110```python111from vllm import LLM, SamplingParams112from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor113from PIL import Image114 115llm = LLM(116    model="deepseek-ai/DeepSeek-OCR",117    enable_prefix_caching=False,118    mm_processor_cache_gb=0,119    logits_processors=[NGramPerReqLogitsProcessor]120)121 122image_paths = ["page1.png", "page2.png", "page3.png"]123prompt = "<image>\n<|grounding|>Convert the document to markdown. "124 125model_input = [126    {127        "prompt": prompt,128        "multi_modal_data": {"image": Image.open(p).convert("RGB")}129    }130    for p in image_paths131]132 133sampling_params = SamplingParams(134    temperature=0.0,135    max_tokens=8192,136    extra_args=dict(137        ngram_size=30,138        window_size=90,139        whitelist_token_ids={128821, 128822},140    ),141    skip_special_tokens=False,142)143 144outputs = llm.generate(model_input, sampling_params)145 146for path, output in zip(image_paths, outputs):147    print(f"=== {path} ===")148    print(output.outputs[0].text)149```150 151### PDF Processing (via vLLM scripts)152 153```bash154cd DeepSeek-OCR-master/DeepSeek-OCR-vllm155# Edit config.py: set INPUT_PATH, OUTPUT_PATH, model path, etc.156python run_dpsk_ocr_pdf.py   # ~2500 tokens/s on A100-40G157```158 159### Benchmark Evaluation160 161```bash162cd DeepSeek-OCR-master/DeepSeek-OCR-vllm163python run_dpsk_ocr_eval_batch.py164```165 166---167 168## Inference: HuggingFace Transformers169 170```python171import os172import torch173from transformers import AutoModel, AutoTokenizer174 175os.environ["CUDA_VISIBLE_DEVICES"] = "0"176 177model_name = "deepseek-ai/DeepSeek-OCR"178 179tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)180model = AutoModel.from_pretrained(181    model_name,182    _attn_implementation="flash_attention_2",183    trust_remote_code=True,184    use_safetensors=True,185)186model = model.eval().cuda().to(torch.bfloat16)187 188# Document to markdown189res = model.infer(190    tokenizer,191    prompt="<image>\n<|grounding|>Convert the document to markdown. ",192    image_file="document.jpg",193    output_path="./output/",194    base_size=1024,195    image_size=640,196    crop_mode=True,197    save_results=True,198    test_compress=True,199)200print(res)201```202 203### Transformers Script204 205```bash206cd DeepSeek-OCR-master/DeepSeek-OCR-hf207python run_dpsk_ocr.py208```209 210---211 212## Prompt Reference213 214| Use Case | Prompt |215|---|---|216| Document → Markdown | `<image>\n<|grounding|>Convert the document to markdown. ` |217| General OCR | `<image>\n<|grounding|>OCR this image. ` |218| Free OCR (no layout) | `<image>\nFree OCR. ` |219| Parse figure/chart | `<image>\nParse the figure. ` |220| General description | `<image>\nDescribe this image in detail. ` |221| Grounded REC | `<image>\nLocate <\|ref\|>TARGET_TEXT<\|/ref\|> in the image. ` |222 223```python224PROMPTS = {225    "document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",226    "ocr_image":         "<image>\n<|grounding|>OCR this image. ",227    "free_ocr":          "<image>\nFree OCR. ",228    "parse_figure":      "<image>\nParse the figure. ",229    "describe":          "<image>\nDescribe this image in detail. ",230    "rec":               "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",231}232```233 234---235 236## Supported Resolutions237 238| Mode | Resolution | Vision Tokens |239|---|---|---|240| Tiny | 512×512 | 64 |241| Small | 640×640 | 100 |242| Base | 1024×1024 | 256 |243| Large | 1280×1280 | 400 |244| Gundam (dynamic) | n×640×640 + 1×1024×1024 | variable |245 246```python247# Transformers: control resolution via infer() params248res = model.infer(249    tokenizer,250    prompt=prompt,251    image_file="image.jpg",252    base_size=1024,   # 512, 640, 1024, or 1280253    image_size=640,   # patch size for dynamic mode254    crop_mode=True,   # True = Gundam dynamic resolution255)256```257 258---259 260## Configuration (vLLM)261 262Edit `DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py`:263 264```python265# Key config fields (example)266MODEL_PATH = "deepseek-ai/DeepSeek-OCR"   # or local path267INPUT_PATH = "/data/input_images/"268OUTPUT_PATH = "/data/output/"269TENSOR_PARALLEL_SIZE = 1                   # GPUs for tensor parallelism270MAX_TOKENS = 8192271TEMPERATURE = 0.0272NGRAM_SIZE = 30273WINDOW_SIZE = 90274```275 276---277 278## Common Patterns279 280### Process a Directory of Images281 282```python283import os284from pathlib import Path285from PIL import Image286from vllm import LLM, SamplingParams287from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor288 289def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):290    Path(output_dir).mkdir(parents=True, exist_ok=True)291    292    llm = LLM(293        model="deepseek-ai/DeepSeek-OCR",294        enable_prefix_caching=False,295        mm_processor_cache_gb=0,296        logits_processors=[NGramPerReqLogitsProcessor],297    )298    sampling_params = SamplingParams(299        temperature=0.0,300        max_tokens=8192,301        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),302        skip_special_tokens=False,303    )304    305    image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))306    307    inputs = [308        {"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}309        for f in image_files310    ]311    312    outputs = llm.generate(inputs, sampling_params)313    314    for img_path, output in zip(image_files, outputs):315        out_file = Path(output_dir) / (img_path.stem + ".txt")316        out_file.write_text(output.outputs[0].text)317        print(f"Saved: {out_file}")318 319batch_ocr("/data/scans/", "/data/results/")320```321 322### Convert PDF Pages to Markdown323 324```python325import fitz  # PyMuPDF326from PIL import Image327from io import BytesIO328from vllm import LLM, SamplingParams329from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor330 331def pdf_to_markdown(pdf_path: str) -> list[str]:332    doc = fitz.open(pdf_path)333    llm = LLM(334        model="deepseek-ai/DeepSeek-OCR",335        enable_prefix_caching=False,336        mm_processor_cache_gb=0,337        logits_processors=[NGramPerReqLogitsProcessor],338    )339    sampling_params = SamplingParams(340        temperature=0.0,341        max_tokens=8192,342        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),343        skip_special_tokens=False,344    )345    346    prompt = "<image>\n<|grounding|>Convert the document to markdown. "347    inputs = []348    for page in doc:349        pix = page.get_pixmap(dpi=150)350        img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")351        inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})352    353    outputs = llm.generate(inputs, sampling_params)354    return [o.outputs[0].text for o in outputs]355 356pages = pdf_to_markdown("report.pdf")357full_markdown = "\n\n---\n\n".join(pages)358print(full_markdown)359```360 361### Grounded Text Location (REC)362 363```python364import torch365from transformers import AutoModel, AutoTokenizer366 367model_name = "deepseek-ai/DeepSeek-OCR"368tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)369model = AutoModel.from_pretrained(370    model_name,371    _attn_implementation="flash_attention_2",372    trust_remote_code=True,373    use_safetensors=True,374).eval().cuda().to(torch.bfloat16)375 376target = "Total Amount"377prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "378 379res = model.infer(380    tokenizer,381    prompt=prompt,382    image_file="invoice.jpg",383    output_path="./output/",384    base_size=1024,385    image_size=640,386    crop_mode=False,387    save_results=True,388)389print(res)  # Returns bounding box / location info390```391 392---393 394## Troubleshooting395 396### `transformers` version conflict with vLLM397vLLM 0.8.5 requires `transformers>=4.51.1` — if running both in the same env, this error is safe to ignore per the project docs.398 399### Flash Attention build errors400```bash401# Ensure torch is installed before flash-attn402pip install flash-attn==2.7.3 --no-build-isolation403```404 405### CUDA out of memory406- Use smaller resolution: `base_size=512` or `base_size=640`407- Disable `crop_mode=False` to avoid multi-crop dynamic resolution408- Reduce batch size in vLLM inputs409 410### Model output is garbled / repetitive411Ensure `NGramPerReqLogitsProcessor` is passed to `LLM` — this is required for proper decoding:412```python413from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor414llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])415```416 417### Tables not rendering correctly418Add table token IDs to the whitelist:419```python420whitelist_token_ids={128821, 128822}  # <td> and </td>421```422 423### Multi-GPU inference424```python425llm = LLM(426    model="deepseek-ai/DeepSeek-OCR",427    tensor_parallel_size=4,  # number of GPUs428    enable_prefix_caching=False,429    mm_processor_cache_gb=0,430    logits_processors=[NGramPerReqLogitsProcessor],431)432```433 434---435 436## Key Files437 438```439DeepSeek-OCR-master/440├── DeepSeek-OCR-vllm/441│   ├── config.py                  # vLLM configuration442│   ├── run_dpsk_ocr_image.py      # Single image inference443│   ├── run_dpsk_ocr_pdf.py        # PDF batch inference444│   └── run_dpsk_ocr_eval_batch.py # Benchmark evaluation445└── DeepSeek-OCR-hf/446    └── run_dpsk_ocr.py            # HuggingFace Transformers inference447```
Related skills
Agency Agents Ai Specialists

Install Agency Agents Ai Specialists skill for Claude Code from aradotso/trending-skills.
Agent Browser Automation

Install Agent Browser Automation skill for Claude Code from aradotso/trending-skills.
Antigravity Manager

Install Antigravity Manager skill for Claude Code from aradotso/trending-skills.