Reusable Pipeline for Evaluating AI Meeting Summaries

ai-technology · 2026-04-25

A new evaluation system designed for generative AI, specifically for summarizing meetings, has been developed. This pipeline operates through five distinct stages: taking in sources, building structured references, generating candidates, scoring in a structured manner, and reporting results. Unlike standalone scorers, it uses both the actual data and evaluator outputs as consistent artifacts, aiding in analysis and statistical evaluation. The system was evaluated using a dataset of 114 meetings from city_council, private_data, and whitehouse_press_briefings, resulting in 340 meeting-model pairs and 680 evaluations with gpt-4.1-mini, gpt-5-mini, and gpt-5.1. gpt-4.1-mini achieved the best mean accuracy at 0.583, while gpt-5.1 showed superior completeness at 0.886 and coverage at 0.942. The artifact package is now public.

Key facts

Reusable evaluation pipeline for generative AI applications
Instantiated for AI meeting summaries
Five stages: source intake, structured reference construction, candidate generation, structured scoring, reporting
Treats ground truth and evaluator outputs as typed, persisted artifacts
Benchmarked on 114 meetings from city_council, private_data, whitehouse_press_briefings
340 meeting-model pairs and 680 judge runs
Models: gpt-4.1-mini, gpt-5-mini, gpt-5.1
gpt-4.1-mini achieved highest mean accuracy (0.583)
gpt-5.1 led in completeness (0.886) and coverage (0.942)
Public artifact package released

Reusable Pipeline for Evaluating AI Meeting Summaries

Key facts

Entities

Institutions

Sources