LLM Benchmarking Framework for Automated Math Competency Assessment

other · 2026-05-01

A study proposes a Human-in-the-Loop benchmarking framework to evaluate heterogeneous LLMs for automating competency-based assessment in secondary-level mathematics, using Nepal's Grade 10 curriculum. The multi-provider ensemble includes open-weight models Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B), and proprietary models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro). Ground truth was established by two senior mathematics faculty members with high inter-rater reliability (kappa_w = 0.8652). The framework addresses the manual challenge of qualitative competency mapping in Competency-Based Education.

Key facts

Human-in-the-Loop benchmarking framework for LLMs in automated competency assessment
Uses Grade 10 Optional Mathematics curriculum in Nepal
Multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, Behavior and Correlation
Ensemble includes Eagle (Llama 3.1-8B), Orion (Llama 3.3-70B), Nova (Gemini 2.5 Flash), Lyra (Gemini 3 Pro)
Ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652)
Published on arXiv (2604.26607)

Entities

Institutions

arXiv

Locations

Nepal

Sources

arXiv cs.AI — 2026-04-30