Macro: Preference Alignment Framework Improves Multilingual Counterfactual Generation
A novel framework named Macro employs Direct Preference Optimization to improve the generation of multilingual counterfactual explanations by large language models. Self-generated counterfactual explanations (SCEs) are inputs that are slightly altered to reverse an LLM's predictions, providing insights into its opaque behavior. However, creating SCEs in languages other than English has been difficult due to a conflict between validity and minimality. Macro addresses this by utilizing a composite scoring function to create preference pairs that quantify this trade-off. Experiments conducted on four LLMs across seven diverse languages demonstrate that Macro enhances validity by an average of 12.55% compared to the chain-of-thought baseline while maintaining minimality.
Key facts
- Macro is a preference alignment framework for multilingual SCE generation.
- It applies Direct Preference Optimization (DPO).
- SCEs are minimally modified inputs that flip LLM predictions.
- The trade-off between validity and minimality is addressed.
- A composite scoring function constructs preference pairs.
- Experiments involved four LLMs and seven languages.
- Validity improved by 12.55% on average over chain-of-thought baseline.
- Minimality was not degraded.
Entities
—