Adaptive Tensor Parallelism Accelerates Long-Tail Generation in RLHF

ai-technology · 2026-05-26

Researchers have introduced PAT, a method for adaptive tensor parallelism that alters TP configurations in real-time during the synchronous RLHF training generation phase. This approach tackles the issue of response-length skew, which leads to inefficient GPU usage for lengthy responses. By employing predictor-guided online reconfiguration, PAT determines the optimal timing and manner for adjusting TP settings based on prior profiling, initiating changes only when the advantages in latency surpass the associated costs.

Key facts

RLHF is a key post-training paradigm for improving model quality.
Synchronous three-stage RLHF pipeline is bottlenecked by generation stage.
Response-length skew causes effective batch size to shrink during decoding.
Mainstream frameworks use static tensor parallelism (TP) configuration.
PAT is an adaptive TP method that dynamically reconfigures TP during generation.
PAT introduces predictor-guided online reconfiguration method.
Reconfiguration point and target TP configuration are based on offline profiling.
Reconfiguration is triggered only when predicted latency benefit outweighs cost.

Entities

—

Sources

arXiv cs.AI — 2026-05-26