Token-Level Data Valuation Framework for LLMs
A novel framework for valuing data in Large Language Models (LLMs) shifts from traditional static row-count techniques to a utility-driven pricing model. This methodology is structured in three tiers: metrics for token-level information density utilizing Shannon entropy and Data Quality Scores; measurement of empirical training gains through influence functions, proxy model approaches, and Data Shapley values; and cryptographic verification via hash-based commitments, Merkle trees, and a tamper-proof training ledger. Experimental tests in instruction following, mathematical reasoning, and code summarization indicate that the proxy-based empirical gain closely aligns with actual utility, achieving nearly perfect ranking consistency.
Key facts
- Traditional data valuation methods based on 'row-count × quality coefficient' fail for LLMs.
- The framework uses token-level information density metrics with Shannon entropy and Data Quality Scores.
- Empirical training gain is measured via influence functions, proxy model strategies, and Data Shapley values.
- Cryptographic verifiability uses hash-based commitments, Merkle trees, and a tamper-evident training ledger.
- Experimental validation covers three domains: instruction following, mathematical reasoning, and code summarization.
- Proxy-based empirical gain achieves near-perfect ranking alignment with realized utility.
- The paper is published on arXiv with ID 2604.22893.
- The framework transitions from static accounting to utility-based pricing.
Entities
Institutions
- arXiv