LLM Benchmarking Framework for Automated Math Competency Assessment
A study proposes a Human-in-the-Loop benchmarking framework to evaluate heterogeneous LLMs for automating competency-based assessment in secondary-level mathematics, using Nepal's Grade 10 curriculum. The multi-provider ensemble includes open-weight models Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B), and proprietary models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro). Ground truth was established by two senior mathematics faculty members with high inter-rater reliability (kappa_w = 0.8652). The framework addresses the manual challenge of qualitative competency mapping in Competency-Based Education.
Key facts
- Human-in-the-Loop benchmarking framework for LLMs in automated competency assessment
- Uses Grade 10 Optional Mathematics curriculum in Nepal
- Multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, Behavior and Correlation
- Ensemble includes Eagle (Llama 3.1-8B), Orion (Llama 3.3-70B), Nova (Gemini 2.5 Flash), Lyra (Gemini 3 Pro)
- Ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652)
- Published on arXiv (2604.26607)
Entities
Institutions
- arXiv
Locations
- Nepal