GR-Ben: New Benchmark for Evaluating Process Reward Models Beyond Math

other · 2026-05-06

Researchers introduced GR-Ben, a process-level benchmark for evaluating process reward models (PRMs) across two primary reasoning domains—science and logic—and nine subdomains. Existing benchmarks focus mainly on mathematical reasoning, failing to assess PRMs' error detection in diverse scenarios. GR-Ben tests 22 models including PRMs and large language models (LLMs), revealing that error-detection ability declines in non-mathematical domains. The benchmark addresses the need for comprehensive evaluation as LLMs often produce flawed intermediate reasoning steps. The paper is available on arXiv.

Key facts

GR-Ben is a process-level benchmark for PRMs.
It covers two primary reasoning domains: science and logic.
It includes nine subdomains.
22 models were tested, including PRMs and LLMs.
Existing benchmarks focus primarily on mathematical reasoning.
Error-detection ability declines in non-mathematical domains.
The benchmark addresses real-world error detection needs.
The paper is published on arXiv.

GR-Ben: New Benchmark for Evaluating Process Reward Models Beyond Math

Key facts

Entities

Institutions

Sources