EAGLE3 Speculative Decoding Boosts PayPal Commerce Agent

ai-technology · 2026-04-24

A recent study investigates the use of speculative decoding with EAGLE3 as an optimization technique during inference for PayPal's Commerce Agent, which utilizes a fine-tuned llama3.1-nemotron-nano-8B-v1 model. This research builds on previous findings from NEMO-4-PAYPAL, benchmarking EAGLE3 against NVIDIA NIM using vLLM on the same 2xH100 hardware across 40 different configurations. Notable results indicate that with gamma=3, there is a 22-49% increase in throughput and an 18-33% decrease in latency without extra hardware costs; acceptance rates hover around 35.5% for gamma=3; gamma=5 shows diminishing returns (approximately 25% acceptance); LLM-as-Judge assessments validate the output quality; and speculative decoding on one H100 performs comparably or better than NIM on two H100s.

Key facts

Evaluates speculative decoding with EAGLE3 for PayPal's Commerce Agent
Model: fine-tuned llama3.1-nemotron-nano-8B-v1
Benchmarked against NVIDIA NIM on 2xH100 hardware
40 configurations tested: gamma=3, gamma=5, concurrency 1-32, temperatures 0 and 0.5
gamma=3: 22-49% throughput improvement, 18-33% latency reduction
Acceptance rate for gamma=3: ~35.5%
gamma=5 acceptance rate: ~25%
Single H100 with speculative decoding matches or exceeds two H100s with NIM

EAGLE3 Speculative Decoding Boosts PayPal Commerce Agent

Key facts

Entities

Institutions

Sources