Data Selection Reformulated as Sequential Decision-Making

other · 2026-06-01

A fresh theoretical model presents data selection as a sequential decision-making challenge, with optimal sequences obtained through dynamic programming techniques. Data values are viewed as representations of this optimal sequence, integrating existing approaches such as Data Shapley as short-sighted linear estimates. The research examines the decline in selection optimality due to utility curvature in submodular contexts, clarifying the shortcomings of current approximations. To connect theory with practical application, an effective bipartite graph-based surrogate is introduced, maintaining submodular characteristics for scalable greedy selection with demonstrable guarantees. The methodology is tested through experiments on traditional tasks.

Key facts

Data selection is reformulated as a sequential decision-making problem
Optimal selection sequence arises from dynamic programming
Data values are encodings of the optimal sequence
Data Shapley is reinterpreted as a myopic linear approximation
Selection optimality degrades with utility curvature under submodularity
A bipartite graph-based surrogate enables scalable greedy selection
The surrogate preserves submodular structure with provable guarantees
Experiments conducted on classical tasks

Entities

—

Sources

arXiv cs.AI — 2026-06-01