SeDT: Improving LLM Multi-Turn Reliability via Reinforcement Learning Conditioning
A recent study indicates that large language models (LLMs) can experience a performance decline of up to 39% when tasks are disclosed gradually over multiple interactions, a situation referred to as 'Lost in Conversation.' This drop in performance is mainly attributed to reliability issues: while the optimal capability decreases by only 16%, unreliability skyrockets by over 112%. The researchers suggest that the underlying issue is structural, as a flat conversation history treats each previous turn with equal importance, hindering the model's ability to identify essential constraints versus trivial dialogue. To remedy this, they introduce SeDT (Sentence-transformer Decision-Transformer), a method that requires no training and utilizes return-to-go conditioning from offline reinforcement learning. SeDT assigns a cumulative relevance score to each segment of conversation based on three elements: a sentence transformer for semantic relevance, a decision transformer for sequential choices, and a return-to-go mechanism to emphasize valuable turns. This approach can be implemented on any existing LLM without further training. The research is accessible on arXiv with ID 2605.26788.
Key facts
- LLMs lose up to 39% performance in multi-turn tasks.
- Best-case aptitude drops only 16%.
- Unreliability more than doubles (+112%).
- Root cause is flat conversation history with equal turn weighting.
- SeDT uses return-to-go conditioning from offline reinforcement learning.
- SeDT is training-free and inference-time only.
- Method annotates conversation shards with cumulative relevance scores.
- Paper available on arXiv: 2605.26788.
Entities
Institutions
- arXiv