TERMS-Bench: A New Benchmark for LLM Negotiation Agents
A new framework called TERMS-Bench (Testbed for Economic Reasoning in Multi-turn Strategy) has been developed by researchers to assess large language model (LLM) negotiation agents using a Bayesian-game approach, moving beyond basic metrics like deal rate. Negotiation, an essential economic process characterized by multi-turn interactions, concealed preferences, strategic dialogue, and binding constraints, poses challenges for evaluation due to the absence of an inherent verifier, unlike mathematics or coding. Current assessments typically depend on LLM interactions or overall results, which can obscure failures. TERMS-Bench solves this by using the environment as the verifier, detailing the counterpart's hidden type, policy, and payoff structure. This framework is applied to bilateral price negotiations, where the agent cannot see the counterpart’s private state and simulator policy, but the evaluator can. This enables precise identification of negotiation agent failures. The research is available on arXiv with the identifier 2605.13909.
Key facts
- TERMS-Bench is a Bayesian-game framework for evaluating LLM negotiation agents.
- It goes beyond deal rate to diagnose specific failures.
- Negotiation involves multi-turn interaction, hidden preferences, strategic communication, and binding constraints.
- Existing evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes.
- TERMS-Bench makes the environment the verifier by specifying counterpart's latent type, policy, and payoff structure.
- It is instantiated in bilateral price negotiation.
- The counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator.
- The paper is available on arXiv with ID 2605.13909.
Entities
Institutions
- arXiv