Microservice Architecture for Production OCR and LLM Pipelines

ai-technology · 2026-05-20

A new paper from arXiv presents a microservice architecture designed to operationalize document AI at scale, bridging the gap between academic model research and production deployment. The system encapsulates pipelines for classification, OCR, and LLM-based structured field extraction, processing thousands of multi-page documents per hour. Key design decisions include hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, asynchronous IO-bound operations, and independent horizontal scaling. Batch profiling revealed two surprising findings: OCR dominates end-to-end latency over language-model parsing, and the system saturates under certain conditions. The paper details practical insights for deploying document AI in production environments.

Key facts

arXiv paper 2605.18818
Microservice architecture for OCR and LLM pipelines
Processes thousands of multi-page documents per hour
Hybrid classification approach
Separates GPU-bound inference from CPU-bound orchestration
Asynchronous processing for IO-bound operations
Independent horizontal scaling strategy
OCR dominates latency over language-model parsing

Microservice Architecture for Production OCR and LLM Pipelines

Key facts

Entities

Institutions

Sources