Aligning Language Models with Online Natural Language Feedback
Scientists have created techniques to align language models in ambiguous areas where human specialists can offer excellent guidance, albeit for a limited number of outputs, through real-time natural language feedback. This strategy includes optimizing iteratively based on proxy reward signals, halting before over-optimization, gathering new expert supervision, and revising the proxy reward. Proxy reward models are developed from language models through in-context learning and fine-tuning. These methods were evaluated by assessing the creative writing skills of Qwen3-8B and the alignment research abilities of Haiku 4.5.
Key facts
- The paper is arXiv:2605.04356v1.
- Methods align language models in fuzzy domains with online natural language feedback.
- Training involves iterative optimization against proxy reward signals.
- Proxy reward models use in-context learning and fine-tuning.
- Tests conducted on Qwen3-8B for creative writing.
- Tests conducted on Haiku 4.5 for alignment research.
- Human experts provide high-quality supervision for a small number of outputs.
- The approach stops at the point of over-optimization.
Entities
Institutions
- arXiv