LLM Self-Referential Validation: Generative-Evaluative Agreement Criterion
A novel validity measure known as Generative-Evaluative Agreement (GEA) has been established for adaptive assessments utilizing LLMs. GEA evaluates if the scoring function of an LLM accurately reflects the skill levels that its generative function was designed to produce, tackling the issue of self-referential validation when the same LLM creates items, simulates responses, and scores them. In an initial direct evaluation of a two-stage adaptive assessment, the model captured about half of the intended variance (r = 0.698) with a consistent positive bias. While GEA showed a strong correlation (r > 0.7) for syntactically verifiable skills, it was nearly zero for design-level skills, and low-skill overestimation inflated scores near the routing threshold. The research suggests that detailed, skill-decomposed rubrics are crucial for enhancing GEA, along with additional mitigation strategies.
Key facts
- Generative-Evaluative Agreement (GEA) is a new validity criterion for LLM-enabled adaptive assessments.
- GEA measures whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce.
- The validation loop is self-referential when the same LLM generates items, simulates responses, and scores them.
- First direct measurement of GEA on a two-stage adaptive assessment recovered roughly half the intended variance (r = 0.698).
- Systematic positive bias was observed in the assessment.
- GEA was strong (r > 0.7) for syntactically verifiable skills but near zero for design-level skills.
- Low-skill overestimation inflated scores near the routing threshold.
- Granular, skill-decomposed rubrics are proposed as the principal mechanism for strengthening GEA.
Entities
—