RLearner-LLM: Hybrid-DPO Enhances Logical Grounding in LLMs

ai-technology · 2026-05-07

The introduction of RLearner-LLM, utilizing Hybrid Direct Preference Optimization (Hybrid-DPO), tackles the logical alignment issues present in large language models (LLMs). This approach integrates a natural language inference (NLI) signal from DeBERTa-v3 alongside a verifier LLM score, thus eliminating the necessity for human annotation. Traditional DPO often exhibits a verbosity bias, prioritizing fluency over logical accuracy, which results in low NLI entailment scores (0.05-0.22) in SFT models, despite producing fluent text. Hybrid-DPO mitigates this 'alignment tax', achieving up to a sixfold increase in NLI scores across five academic fields (Biology, Medicine, Law) with three foundational architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it). Notably, NLI improvements were recorded in 11 out of 15 cells, with consistent advancements in answer coverage. Significant enhancements were noted on Gemma 4 E4B-it (4.5B effective parameters) using Hybrid-DPO.

Key facts

RLearner-LLM uses Hybrid-DPO to balance logical grounding and fluency.
Hybrid-DPO fuses DeBERTa-v3 NLI signal with a verifier LLM score.
Standard DPO has a verbosity bias favoring fluency over logical correctness.
SFT models achieve NLI entailment of only 0.05-0.22.
Evaluated on Biology, Medicine, and Law domains.
Base architectures: LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it.
Up to 6x NLI improvement over SFT.
NLI gains in 11 of 15 cells.
Gemma 4 E4B-it has 4.5B effective parameters.

Entities

Institutions

arXiv
DeBERTa
LLaMA
Qwen
Gemma

Sources

arXiv cs.AI — 2026-05-07