EAGLE3 Speculative Decoding Boosts PayPal Commerce Agent
A recent study investigates the use of speculative decoding with EAGLE3 as an optimization technique during inference for PayPal's Commerce Agent, which utilizes a fine-tuned llama3.1-nemotron-nano-8B-v1 model. This research builds on previous findings from NEMO-4-PAYPAL, benchmarking EAGLE3 against NVIDIA NIM using vLLM on the same 2xH100 hardware across 40 different configurations. Notable results indicate that with gamma=3, there is a 22-49% increase in throughput and an 18-33% decrease in latency without extra hardware costs; acceptance rates hover around 35.5% for gamma=3; gamma=5 shows diminishing returns (approximately 25% acceptance); LLM-as-Judge assessments validate the output quality; and speculative decoding on one H100 performs comparably or better than NIM on two H100s.
Key facts
- Evaluates speculative decoding with EAGLE3 for PayPal's Commerce Agent
- Model: fine-tuned llama3.1-nemotron-nano-8B-v1
- Benchmarked against NVIDIA NIM on 2xH100 hardware
- 40 configurations tested: gamma=3, gamma=5, concurrency 1-32, temperatures 0 and 0.5
- gamma=3: 22-49% throughput improvement, 18-33% latency reduction
- Acceptance rate for gamma=3: ~35.5%
- gamma=5 acceptance rate: ~25%
- Single H100 with speculative decoding matches or exceeds two H100s with NIM
Entities
Institutions
- PayPal
- NVIDIA