Token-Level Data Valuation Framework for LLMs

other · 2026-04-29

A novel framework for valuing data in Large Language Models (LLMs) shifts from traditional static row-count techniques to a utility-driven pricing model. This methodology is structured in three tiers: metrics for token-level information density utilizing Shannon entropy and Data Quality Scores; measurement of empirical training gains through influence functions, proxy model approaches, and Data Shapley values; and cryptographic verification via hash-based commitments, Merkle trees, and a tamper-proof training ledger. Experimental tests in instruction following, mathematical reasoning, and code summarization indicate that the proxy-based empirical gain closely aligns with actual utility, achieving nearly perfect ranking consistency.

Key facts

Traditional data valuation methods based on 'row-count × quality coefficient' fail for LLMs.
The framework uses token-level information density metrics with Shannon entropy and Data Quality Scores.
Empirical training gain is measured via influence functions, proxy model strategies, and Data Shapley values.
Cryptographic verifiability uses hash-based commitments, Merkle trees, and a tamper-evident training ledger.
Experimental validation covers three domains: instruction following, mathematical reasoning, and code summarization.
Proxy-based empirical gain achieves near-perfect ranking alignment with realized utility.
The paper is published on arXiv with ID 2604.22893.
The framework transitions from static accounting to utility-based pricing.

Token-Level Data Valuation Framework for LLMs

Key facts

Entities

Institutions

Sources