GLIDE Library Unifies Prediction-Powered Inference for Reliable GenAI Evaluation
GLIDE, a newly launched open-source Python library, integrates advanced prediction-powered inference (PPI) techniques for assessing generative AI and agentic systems. By merging expensive human annotations with biased LLM-as-judge proxies, PPI generates debiased estimates accompanied by reliable confidence intervals. This library encompasses various estimators, such as PPI++, Stratified PPI, Predict-Then-Debias, and Active Statistical Inference, as well as samplers like uniform, stratified, active, and cost-optimal, all through a scipy-style API tailored for mean estimation. Additionally, GLIDE features a reproducible Monte Carlo validation suite, a decision tree rooted in empirical data for method selection, and a case study on agentic evaluation that reveals significant annotation savings while maintaining precision. The library is accessible on GitHub.
Key facts
- GLIDE is an open-source Python library for prediction-powered inference.
- It unifies PPI estimators: PPI++, Stratified PPI, Predict-Then-Debias, Active Statistical Inference.
- It includes samplers: uniform, stratified, active, cost-optimal.
- API is scipy-style and specialized for mean estimation.
- Comes with a reproducible Monte Carlo validation suite.
- Includes an empirically grounded decision tree for method selection.
- Agentic evaluation case study shows annotation savings at equivalent precision.
- GLIDE is available on GitHub.
Entities
Institutions
- arXiv