GAIA-v2-LILT: Multilingual Agent Benchmark Beyond Translation

ai-technology · 2026-04-30

A team of researchers has developed an enhanced process for translating English agent benchmarks into various languages, tackling issues related to query-answer discrepancies and cultural relevance. They present GAIA-v2-LILT, a newly audited multilingual version of GAIA that encompasses five languages other than English. This workflow incorporates explicit functional alignment, cultural considerations, and difficulty adjustments through automated evaluations and manual assessments. Their experiments indicate that this approach boosts agent success rates by as much as 32.7% compared to minimally translated versions, bringing the best audited results within 3.1% of English performance, although significant gaps persist in several instances. The research highlights that insufficient machine translation and minimal post-editing can undermine the validity of benchmarks for agentic tasks.

Key facts

GAIA-v2-LILT is a multilingual extension of GAIA covering five non-English languages.
The workflow includes functional alignment, cultural alignment, and difficulty calibration.
Improves agent success rates by up to 32.7% over minimally translated versions.
Closest audited setting is within 3.1% of English performance.
Substantial gaps remain in many cases.
Minimal MT and limited post-editing can break benchmark validity.
Automated checks and human review are used.
The study is published on arXiv (2604.24929).

GAIA-v2-LILT: Multilingual Agent Benchmark Beyond Translation

Key facts

Entities

Institutions

Sources