OPT-BENCH: Quality-Aware RLVR Framework for NP-Hard Optimization in LLMs
A recent preprint on arXiv presents OPT-BENCH, the inaugural all-encompassing framework designed for the training and assessment of Large Language Models (LLMs) tackling NP-hard optimization challenges through quality-aware Reinforcement Learning with Verifiable Rewards (RLVR). This framework fills a void in current benchmarks, which focus solely on correctness rather than optimality, or the capability to identify the best solutions within constraints. OPT-BENCH comprises three elements: a scalable training setup featuring instance generators, quality verifiers, and optimal baselines across ten tasks; a benchmark of 1,000 instances that measures feasibility through Success Rate and quality via Quality Ratio; and quality-aware rewards that facilitate ongoing enhancement beyond mere binary correctness. Training utilized Qwen2.5-7B-Instruct-1M with 15,000 examples. The paper can be found on arXiv under ID 2605.08905.
Key facts
- OPT-BENCH is the first framework for training and evaluating LLMs on NP-hard optimization problems with quality-aware RLVR.
- Existing benchmarks evaluate only correctness, not optimality.
- The framework includes instance generators, quality verifiers, and optimal baselines across 10 tasks.
- The benchmark comprises 1,000 instances.
- Success Rate measures feasibility; Quality Ratio measures quality.
- Quality-aware rewards enable continuous improvement beyond binary correctness.
- Training used Qwen2.5-7B-Instruct-1M with 15,000 examples.
- The paper is published on arXiv with ID 2605.08905.
Entities
Institutions
- arXiv