LLM Intent Fidelity Evaluation Framework Reveals Structural-Content Split

ai-technology · 2026-05-16

A novel assessment framework for large language models (LLMs) differentiates between the reproduction of structural forms and the preservation of specific intents. This research analyzed 2,880 outputs across three languages, three task domains, and six LLMs, employing structured prompt ablation to evaluate both structural recovery and intent fidelity across various semantic dimensions. Findings reveal a consistent divide between structural fidelity and intent: 25.7% of Chinese outputs achieving perfect holistic alignment scores (GA=5) showed intent deficits, which increased to 58.6% for English outputs. Human assessments validated that these outputs in the split-zone indicate real quality issues and that dimensional fidelity scores align with human evaluations.

Key facts

Proposes dimension-level intent fidelity evaluation framework
Applied structured prompt ablation across 2,880 outputs
Covered three languages, three task domains, six LLMs
Measures structural recovery and intent fidelity separately
25.7% of Chinese outputs with GA=5 had intent deficits
58.6% of English outputs with GA=5 had intent deficits
Human evaluation confirmed split-zone outputs are genuine deficits
Dimensional fidelity scores align with human judgments

Entities

—

Sources

arXiv cs.AI — 2026-05-16