MIPO: Mutual Information Preference Optimization enhances LLM personalization without extra data

ai-technology · 2026-05-07

Researchers propose Mutual Information Preference Optimization (MIPO), a self-improvement framework for large language models (LLMs) that requires no additional human-labeled data or external verifiers. MIPO constructs preference pairs by generating a positive response conditioned on the correct prompt and a negative response conditioned on a random, unrelated prompt. Using Direct Preference Optimization (DPO) to learn from these pairs maximizes pointwise conditional mutual information between prompts and model responses under the base LLM. This approach addresses the limitation of post-training methods that rely on expensive human oversight and existing exploited data. The method is detailed in arXiv preprint 2603.19294v2.

Key facts

MIPO stands for Mutual Information Preference Optimization.
It constructs preference pairs using correct and random prompts.
Positive response is conditioned on the correct prompt.
Negative response is conditioned on a random, unrelated prompt.
DPO is used to learn from the paired data.
The method maximizes pointwise conditional mutual information.
No additional human-labeled data or external verifiers are needed.
The preprint is available on arXiv with ID 2603.19294v2.

MIPO: Mutual Information Preference Optimization enhances LLM personalization without extra data

Key facts

Entities

Institutions

Sources