BentoML-Based AI Inference System Performance Analysis
A study on arXiv (2604.20420) addresses the underexplored area of AI inference deployment by analyzing a BentoML-based system developed with graphworks.ai. Using a pre-trained RoBERTa sentiment analysis model, baseline performance was established under three realistic workload scenarios. Traffic patterns following gamma and exponential distributions simulated steady, bursty, and high-intensity conditions. Key metrics like latency percentiles and throughput were collected to identify bottlenecks.
Key facts
- Study investigates performance and optimization of a BentoML-based AI inference system
- Collaboration with graphworks.ai
- Uses pre-trained RoBERTa sentiment analysis model
- Three realistic workload scenarios for baseline performance
- Traffic patterns follow gamma and exponential distributions
- Simulates steady, bursty, and high-intensity workloads
- Key metrics: latency percentiles and throughput
- Identifies bottlenecks in inference pipeline
Entities
Institutions
- arXiv
- graphworks.ai