PERSA: RLHF Aligns LLM Feedback with Professor Style
A new pipeline named PERSA has been developed by researchers to enhance transformer-based large language models for providing programming feedback that aligns with a particular professor's grading style. This innovative method integrates supervised fine-tuning based on professor demonstrations, reward modeling derived from pairwise preferences, and Proximal Policy Optimization (PPO), focusing on components that carry style. By only updating the top transformer blocks and their feed-forward projections, PERSA ensures efficient parameter fine-tuning. This technique effectively tackles the issue of harmonizing an LLM's tone with that of an instructor while maintaining diagnostic accuracy. Further details can be found in arXiv:2605.01123.
Key facts
- PERSA uses RLHF to align LLM feedback with professor style.
- Pipeline includes supervised fine-tuning, reward modeling, and PPO.
- Only top transformer blocks and feed-forward projections are updated.
- Parameter-efficient fine-tuning is employed.
- Aims to maintain diagnostic correctness while matching tone.
- Published on arXiv with ID 2605.01123.
Entities
Institutions
- arXiv