BOOST: Bilevel Optimization for Multi-Turn LLM Fine-Tuning
The BOOST framework (Bilevel Optimization of Synthetic Trajectories) tackles the issue of optimizing large language models (LLMs) for multi-turn conversations. Although LLMs excel in single-turn scenarios, they face difficulties with extended, multi-turn exchanges. Offline reinforcement learning (RL) provides a scalable approach, yet it relies heavily on the quality of multi-turn trajectory data. To enhance training, synthetic data from LLMs or simulators is frequently utilized; however, inconsistent quality can impair performance if all trajectories are treated uniformly. BOOST employs a bilevel optimization strategy: the inner level focuses on training the LLM using reweighted data, while the outer level develops a lightweight reweighting head based on held-out real validation tasks. This method assigns continuous trajectory-level weights without needing an external evaluator. The approach is based on a PAC-Bayesian bound highlighting a three-way trade-off: while synthetic data boosts diversity, it may lead to task-shift, and concentrating weights on superior trajectories could diminish diversity. This framework is elaborated in a paper available on arXiv (2605.24743).
Key facts
- BOOST is a bilevel optimization framework for multi-turn LLM fine-tuning.
- It addresses the challenge of heterogeneous quality in synthetic trajectory data.
- The inner level trains the LLM on reweighted data.
- The outer level trains a lightweight reweighting head on held-out real validation tasks.
- Continuous trajectory-level weights are assigned without an external judge.
- A PAC-Bayesian bound reveals a three-way trade-off between diversity, task-shift, and weight concentration.
- The paper is available on arXiv with ID 2605.24743.
- The method targets offline reinforcement learning for LLMs.
Entities
Institutions
- arXiv