Hindsight Preference Optimization Improves LLM Financial Advisory
Researchers propose Hindsight Preference Optimization (HPO), a method that uses observed outcomes to generate preference pairs for training language models on financial time series advisory. HPO bridges reinforcement learning and preference alignment, enabling an LLM judge to rank candidate advisories on dimensions beyond scalar metrics. Applied to Vision-Language-Model-based S&P 500 equity advisories, a 4B model outperformed its 235B teacher in accuracy and advisory quality. The approach addresses the challenge that advisory quality depends on outcomes unknown at prediction time, using hindsight information to create training signals without human annotation.
Key facts
- Hindsight Preference Optimization (HPO) proposed for financial time series advisory
- HPO uses observed outcomes to generate preference pairs for DPO without human annotation
- Applied to Vision-Language-Model-based S&P 500 equity advisories
- 4B model outperformed 235B teacher in accuracy and advisory quality
- Bridges reinforcement learning and preference alignment
- Addresses challenge of outcome-dependent advisory quality
- Uses hindsight information unavailable at prediction time
- Published on arXiv
Entities
Institutions
- arXiv