RIFT Taxonomy Introduces Framework for Diagnosing Rubric Failures in LLM Evaluation

ai-technology · 2026-04-22

There's a new framework called RIFT, which stands for RubrIc Failure mode Taxonomy. Its purpose is to help identify and sort out failures in how we evaluate large language models using rubrics. RIFT sorts these failures into three main categories: Reliability Failures, Content Validity Failures, and Consequential Validity Failures, totaling eight distinct failure types. This tool fills a significant gap in the evaluation methods for these models, as previous approaches didn’t effectively diagnose rubric issues beyond basic results. Developed using grounded theory, RIFT was based on a thorough analysis of rubrics from five diverse datasets, including areas like instruction following and creative writing. You can find more details in the arXiv preprint 2604.01375v2.

Key facts

RIFT taxonomy categorizes eight rubric failure modes
Failure modes organized into three high-level categories: Reliability, Content Validity, Consequential Validity
Developed using grounded theory through iterative annotation
Based on rubrics from five diverse data sources
Addresses gap in diagnosing rubric failures from aggregated signals
Applies to LLM benchmarks and training pipelines for open-ended tasks
Covers domains: instruction following, code generation, creative writing, deep research
arXiv preprint identifier: 2604.01375v2

RIFT Taxonomy Introduces Framework for Diagnosing Rubric Failures in LLM Evaluation

Key facts

Entities

Institutions

Sources