Predicting LLM and Symbolic Program Performance with Few Examples
A recent study published on arXiv (2605.21515) presents a technique for forecasting the effectiveness of both symbolic programs (like Python) and prompt-based LLM executions, utilizing only a limited number of in-domain examples. The authors conceptualize each execution as a Bernoulli random variable, where the success probability reflects the program's unknown performance. Their predictions rely on observed results and prior performance distributions. By gathering empirical performance priors from a varied dataset, they discover that symbolic programs tend to follow an "all or nothing" performance trend, whereas prompt programs display a more dispersed prior with numerous nearly-correct outputs. This distinction clarifies why a few successful tests can validate symbolic programs, but not those based on prompts. The study tackles the inconsistency of LLM prompting, which may pass certain tests yet falter in real-world applications.
Key facts
- arXiv paper 2605.21515
- Predicts performance of symbolic and prompt programs
- Uses coin-flip model (Bernoulli random variable)
- Performance depends on observed outcomes and prior
- Symbolic programs: all-or-nothing performance
- Prompt programs: diffuse prior with many nearly-correct programs
- Few passing tests certify symbolic but not prompt programs
- LLM prompting is unreliable in deployment
Entities
Institutions
- arXiv