RLearner-LLM: Hybrid-DPO Enhances Logical Grounding in LLMs
The introduction of RLearner-LLM, utilizing Hybrid Direct Preference Optimization (Hybrid-DPO), tackles the logical alignment issues present in large language models (LLMs). This approach integrates a natural language inference (NLI) signal from DeBERTa-v3 alongside a verifier LLM score, thus eliminating the necessity for human annotation. Traditional DPO often exhibits a verbosity bias, prioritizing fluency over logical accuracy, which results in low NLI entailment scores (0.05-0.22) in SFT models, despite producing fluent text. Hybrid-DPO mitigates this 'alignment tax', achieving up to a sixfold increase in NLI scores across five academic fields (Biology, Medicine, Law) with three foundational architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it). Notably, NLI improvements were recorded in 11 out of 15 cells, with consistent advancements in answer coverage. Significant enhancements were noted on Gemma 4 E4B-it (4.5B effective parameters) using Hybrid-DPO.
Key facts
- RLearner-LLM uses Hybrid-DPO to balance logical grounding and fluency.
- Hybrid-DPO fuses DeBERTa-v3 NLI signal with a verifier LLM score.
- Standard DPO has a verbosity bias favoring fluency over logical correctness.
- SFT models achieve NLI entailment of only 0.05-0.22.
- Evaluated on Biology, Medicine, and Law domains.
- Base architectures: LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it.
- Up to 6x NLI improvement over SFT.
- NLI gains in 11 of 15 cells.
- Gemma 4 E4B-it has 4.5B effective parameters.
Entities
Institutions
- arXiv
- DeBERTa
- LLaMA
- Qwen
- Gemma