ARTFEED — Contemporary Art Intelligence

Predicting LLM and Symbolic Program Performance with Few Examples

ai-technology · 2026-05-23

A recent study published on arXiv (2605.21515) presents a technique for forecasting the effectiveness of both symbolic programs (like Python) and prompt-based LLM executions, utilizing only a limited number of in-domain examples. The authors conceptualize each execution as a Bernoulli random variable, where the success probability reflects the program's unknown performance. Their predictions rely on observed results and prior performance distributions. By gathering empirical performance priors from a varied dataset, they discover that symbolic programs tend to follow an "all or nothing" performance trend, whereas prompt programs display a more dispersed prior with numerous nearly-correct outputs. This distinction clarifies why a few successful tests can validate symbolic programs, but not those based on prompts. The study tackles the inconsistency of LLM prompting, which may pass certain tests yet falter in real-world applications.

Key facts

  • arXiv paper 2605.21515
  • Predicts performance of symbolic and prompt programs
  • Uses coin-flip model (Bernoulli random variable)
  • Performance depends on observed outcomes and prior
  • Symbolic programs: all-or-nothing performance
  • Prompt programs: diffuse prior with many nearly-correct programs
  • Few passing tests certify symbolic but not prompt programs
  • LLM prompting is unreliable in deployment

Entities

Institutions

  • arXiv

Sources