ARTFEED — Contemporary Art Intelligence

RLearner-LLM: Hybrid-DPO Enhances Logical Grounding in LLMs

ai-technology · 2026-05-07

The introduction of RLearner-LLM, utilizing Hybrid Direct Preference Optimization (Hybrid-DPO), tackles the logical alignment issues present in large language models (LLMs). This approach integrates a natural language inference (NLI) signal from DeBERTa-v3 alongside a verifier LLM score, thus eliminating the necessity for human annotation. Traditional DPO often exhibits a verbosity bias, prioritizing fluency over logical accuracy, which results in low NLI entailment scores (0.05-0.22) in SFT models, despite producing fluent text. Hybrid-DPO mitigates this 'alignment tax', achieving up to a sixfold increase in NLI scores across five academic fields (Biology, Medicine, Law) with three foundational architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it). Notably, NLI improvements were recorded in 11 out of 15 cells, with consistent advancements in answer coverage. Significant enhancements were noted on Gemma 4 E4B-it (4.5B effective parameters) using Hybrid-DPO.

Key facts

  • RLearner-LLM uses Hybrid-DPO to balance logical grounding and fluency.
  • Hybrid-DPO fuses DeBERTa-v3 NLI signal with a verifier LLM score.
  • Standard DPO has a verbosity bias favoring fluency over logical correctness.
  • SFT models achieve NLI entailment of only 0.05-0.22.
  • Evaluated on Biology, Medicine, and Law domains.
  • Base architectures: LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it.
  • Up to 6x NLI improvement over SFT.
  • NLI gains in 11 of 15 cells.
  • Gemma 4 E4B-it has 4.5B effective parameters.

Entities

Institutions

  • arXiv
  • DeBERTa
  • LLaMA
  • Qwen
  • Gemma

Sources