Predicting LLM and Symbolic Program Performance with Few Examples

ai-technology · 2026-05-23

A recent study published on arXiv (2605.21515) presents a technique for forecasting the effectiveness of both symbolic programs (like Python) and prompt-based LLM executions, utilizing only a limited number of in-domain examples. The authors conceptualize each execution as a Bernoulli random variable, where the success probability reflects the program's unknown performance. Their predictions rely on observed results and prior performance distributions. By gathering empirical performance priors from a varied dataset, they discover that symbolic programs tend to follow an "all or nothing" performance trend, whereas prompt programs display a more dispersed prior with numerous nearly-correct outputs. This distinction clarifies why a few successful tests can validate symbolic programs, but not those based on prompts. The study tackles the inconsistency of LLM prompting, which may pass certain tests yet falter in real-world applications.

Key facts

arXiv paper 2605.21515
Predicts performance of symbolic and prompt programs
Uses coin-flip model (Bernoulli random variable)
Performance depends on observed outcomes and prior
Symbolic programs: all-or-nothing performance
Prompt programs: diffuse prior with many nearly-correct programs
Few passing tests certify symbolic but not prompt programs
LLM prompting is unreliable in deployment

Predicting LLM and Symbolic Program Performance with Few Examples

Key facts

Entities

Institutions

Sources