RoundPipe: Near-Zero-Bubble Pipeline for LLM Training on Consumer GPUs
A new pipeline schedule called RoundPipe addresses the weight binding issue in fine-tuning Large Language Models (LLMs) on consumer-grade GPUs. By treating GPUs as stateless execution workers and dispatching computation stages in a round-robin manner, RoundPipe achieves near-zero pipeline bubbles. This method breaks the constraint of binding uneven model stages to specific GPUs, which previously limited throughput. The approach integrates with CPU offloading to mitigate memory and PCIe interconnect bottlenecks, making LLM fine-tuning more efficient on multiple consumer GPUs.
Key facts
- RoundPipe is a novel pipeline schedule for LLM training on consumer GPUs.
- It breaks the weight binding constraint by treating GPUs as stateless workers.
- Computation stages are dispatched in a round-robin manner.
- Achieves near-zero-bubble pipeline.
- Addresses the weight binding issue where uneven stages limit throughput.
- Integrates with CPU offloading to reduce communication overhead.
- Designed for consumer-grade GPU servers with limited memory and slow PCIe.
- Published on arXiv with ID 2604.27085.
Entities
Institutions
- arXiv