ARTFEED — Contemporary Art Intelligence

LLM Benchmarking Framework for Automated Math Competency Assessment

other · 2026-05-01

A study proposes a Human-in-the-Loop benchmarking framework to evaluate heterogeneous LLMs for automating competency-based assessment in secondary-level mathematics, using Nepal's Grade 10 curriculum. The multi-provider ensemble includes open-weight models Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B), and proprietary models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro). Ground truth was established by two senior mathematics faculty members with high inter-rater reliability (kappa_w = 0.8652). The framework addresses the manual challenge of qualitative competency mapping in Competency-Based Education.

Key facts

  • Human-in-the-Loop benchmarking framework for LLMs in automated competency assessment
  • Uses Grade 10 Optional Mathematics curriculum in Nepal
  • Multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, Behavior and Correlation
  • Ensemble includes Eagle (Llama 3.1-8B), Orion (Llama 3.3-70B), Nova (Gemini 2.5 Flash), Lyra (Gemini 3 Pro)
  • Ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652)
  • Published on arXiv (2604.26607)

Entities

Institutions

  • arXiv

Locations

  • Nepal

Sources