Reusable Pipeline for Evaluating AI Meeting Summaries
A new evaluation system designed for generative AI, specifically for summarizing meetings, has been developed. This pipeline operates through five distinct stages: taking in sources, building structured references, generating candidates, scoring in a structured manner, and reporting results. Unlike standalone scorers, it uses both the actual data and evaluator outputs as consistent artifacts, aiding in analysis and statistical evaluation. The system was evaluated using a dataset of 114 meetings from city_council, private_data, and whitehouse_press_briefings, resulting in 340 meeting-model pairs and 680 evaluations with gpt-4.1-mini, gpt-5-mini, and gpt-5.1. gpt-4.1-mini achieved the best mean accuracy at 0.583, while gpt-5.1 showed superior completeness at 0.886 and coverage at 0.942. The artifact package is now public.
Key facts
- Reusable evaluation pipeline for generative AI applications
- Instantiated for AI meeting summaries
- Five stages: source intake, structured reference construction, candidate generation, structured scoring, reporting
- Treats ground truth and evaluator outputs as typed, persisted artifacts
- Benchmarked on 114 meetings from city_council, private_data, whitehouse_press_briefings
- 340 meeting-model pairs and 680 judge runs
- Models: gpt-4.1-mini, gpt-5-mini, gpt-5.1
- gpt-4.1-mini achieved highest mean accuracy (0.583)
- gpt-5.1 led in completeness (0.886) and coverage (0.942)
- Public artifact package released
Entities
Institutions
- arXiv