Aligning Language Models with Online Natural Language Feedback

ai-technology · 2026-05-07

Scientists have created techniques to align language models in ambiguous areas where human specialists can offer excellent guidance, albeit for a limited number of outputs, through real-time natural language feedback. This strategy includes optimizing iteratively based on proxy reward signals, halting before over-optimization, gathering new expert supervision, and revising the proxy reward. Proxy reward models are developed from language models through in-context learning and fine-tuning. These methods were evaluated by assessing the creative writing skills of Qwen3-8B and the alignment research abilities of Haiku 4.5.

Key facts

The paper is arXiv:2605.04356v1.
Methods align language models in fuzzy domains with online natural language feedback.
Training involves iterative optimization against proxy reward signals.
Proxy reward models use in-context learning and fine-tuning.
Tests conducted on Qwen3-8B for creative writing.
Tests conducted on Haiku 4.5 for alignment research.
Human experts provide high-quality supervision for a small number of outputs.
The approach stops at the point of over-optimization.

Aligning Language Models with Online Natural Language Feedback

Key facts

Entities

Institutions

Sources