Hindsight Preference Optimization Improves LLM Financial Advisory

ai-technology · 2026-04-29

Researchers propose Hindsight Preference Optimization (HPO), a method that uses observed outcomes to generate preference pairs for training language models on financial time series advisory. HPO bridges reinforcement learning and preference alignment, enabling an LLM judge to rank candidate advisories on dimensions beyond scalar metrics. Applied to Vision-Language-Model-based S&P 500 equity advisories, a 4B model outperformed its 235B teacher in accuracy and advisory quality. The approach addresses the challenge that advisory quality depends on outcomes unknown at prediction time, using hindsight information to create training signals without human annotation.

Key facts

Hindsight Preference Optimization (HPO) proposed for financial time series advisory
HPO uses observed outcomes to generate preference pairs for DPO without human annotation
Applied to Vision-Language-Model-based S&P 500 equity advisories
4B model outperformed 235B teacher in accuracy and advisory quality
Bridges reinforcement learning and preference alignment
Addresses challenge of outcome-dependent advisory quality
Uses hindsight information unavailable at prediction time
Published on arXiv

Hindsight Preference Optimization Improves LLM Financial Advisory

Key facts

Entities

Institutions

Sources