LLMOps Stack for Fraud and AML Compliance
A recent study presents a specialized LLMOps stack tailored for fraud detection and anti-money laundering (AML) compliance. In contrast to standard chat workloads, compliance-related prompts are characterized by heavy prefixes, strict schemas, and abundant evidence, necessitating effective prefix reuse, management of KV-cache, runtime adjustments, orchestration of models, and validation of outputs. This stack employs self-hosted open-weight models, including Meta Llama and Alibaba Qwen, and features vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, batching that considers adapter and prompt lengths, sleep/wake lifecycle management, speculative decoding, and optional pruning. The study can be accessed on arXiv with the reference number 2605.11232.
Key facts
- The paper focuses on LLMOps for fraud detection and AML compliance.
- Compliance prompts are prefix-heavy, schema-constrained, and evidence-rich.
- The stack uses self-hosted open-weight models: Meta Llama and Alibaba Qwen.
- Techniques include vLLM-style runtime tuning, PagedAttention, and Automatic Prefix Caching.
- Multi-adapter serving and adapter/prompt-length-aware batching are employed.
- Sleep/wake lifecycle management and speculative decoding are part of the stack.
- The paper is published on arXiv with ID 2605.11232.
- The stack is designed for structured outputs like JSON labels or risk factors.
Entities
Institutions
- arXiv
- Meta
- Alibaba