LLM Ensemble Wins SemEval-2026 Multi-Turn Generation Task

ai-technology · 2026-05-07

A heterogeneous ensemble of seven large language models (LLMs) won Task B of SemEval-2026 Task 8: MTRAGEval, achieving a conditioned harmonic mean of 0.7827 and ranking 1st out of 26 teams. The system, developed by RaguTeam, uses a GPT-4o-mini judge to select the best candidate response per instance from two prompting variants across diverse model families and scales. Ablation studies confirmed that ensemble diversity is essential, consistently outperforming any single model including the strongest baseline (gpt-oss-120b, 0.6390). The team also introduced Meno-Lite-0.1, a 7B domain-adapted model offering a strong cost-performance trade-off. Their analysis of MTRAGEval highlighted annotation limitations and directions for improvement. The code is publicly available on GitHub.

Key facts

RaguTeam won Task B of SemEval-2026 Task 8: MTRAGEval.
The system is a heterogeneous ensemble of seven LLMs with two prompting variants.
A GPT-4o-mini judge selects the best candidate per instance.
Achieved conditioned harmonic mean of 0.7827.
Ranked 1st out of 26 teams.
Strongest baseline was gpt-oss-120b with 0.6390.
Introduced Meno-Lite-0.1, a 7B domain-adapted model.
Code is publicly available on GitHub.

Entities

Institutions

RaguTeam
SemEval
MTRAGEval
GPT-4o-mini
Meno-Lite-0.1
GitHub

Sources

arXiv cs.AI — 2026-05-07