Unsloth and NVIDIA Collaborate to Accelerate LLM Training by ~25%

ai-technology · 2026-05-07

Unsloth has partnered with NVIDIA to introduce optimizations that make large language model (LLM) training approximately 25% faster without any loss in accuracy. These improvements build on Unsloth's existing 2-5x speedup and are automatically enabled on RTX laptops, data center GPUs, and DGX Spark machines upon updating Unsloth. The collaboration addresses three key areas: packed-sequence caching, double-buffered activation checkpointing, and efficient Mixture-of-Experts (MoE) routing. Packed-sequence caching reduces overhead by reusing metadata across transformer layers instead of reconstructing it each time. Double-buffered activation checkpointing overlaps data transfer with computation, hiding latency. MoE routing optimizations minimize dynamic indexing queries. Benchmarks on NVIDIA B200 Blackwell GPUs show consistent speedups across larger dense models with minimal extra VRAM usage. The final losses remain effectively unchanged, confirming that these optimizations preserve model quality.

Key facts

Unsloth and NVIDIA collaborated to make LLM training ~25% faster.
Optimizations have no loss in accuracy.
Improvements are auto-enabled on RTX laptops, data center GPUs, and DGX Spark machines.
Packed-sequence caching reuses metadata across layers instead of rebuilding it.
Double-buffered activation checkpointing overlaps copy and compute.
MoE routing groups token assignments to reduce dynamic queries.
Benchmarked on NVIDIA B200 Blackwell GPUs.
Final losses were effectively unchanged.

Unsloth and NVIDIA Collaborate to Accelerate LLM Training by ~25%

Key facts

Entities

Institutions

Sources