Microservice Architecture for Production OCR and LLM Pipelines
A new paper from arXiv presents a microservice architecture designed to operationalize document AI at scale, bridging the gap between academic model research and production deployment. The system encapsulates pipelines for classification, OCR, and LLM-based structured field extraction, processing thousands of multi-page documents per hour. Key design decisions include hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, asynchronous IO-bound operations, and independent horizontal scaling. Batch profiling revealed two surprising findings: OCR dominates end-to-end latency over language-model parsing, and the system saturates under certain conditions. The paper details practical insights for deploying document AI in production environments.
Key facts
- arXiv paper 2605.18818
- Microservice architecture for OCR and LLM pipelines
- Processes thousands of multi-page documents per hour
- Hybrid classification approach
- Separates GPU-bound inference from CPU-bound orchestration
- Asynchronous processing for IO-bound operations
- Independent horizontal scaling strategy
- OCR dominates latency over language-model parsing
Entities
Institutions
- arXiv