FinChain: Verifiable Chain-of-Thought Benchmark for Financial Reasoning
FinChain has been launched by researchers as the inaugural benchmark tailored for verifiable Chain-of-Thought assessment in the financial sector. Covering 58 subjects within 12 financial areas, it employs parameterized symbolic templates alongside executable Python code to facilitate scalable and contamination-free data creation. The CHAINEVAL metric introduced assesses both the accuracy of final answers and the consistency of reasoning at each step. An evaluation of 26 top-performing LLMs indicates that even the most advanced models display significant shortcomings in multi-step symbolic reasoning.
Key facts
- FinChain is the first benchmark for verifiable Chain-of-Thought evaluation in finance.
- It covers 58 topics across 12 financial domains.
- Uses parameterized symbolic templates with executable Python code.
- Enables fully machine-verifiable reasoning and contamination-free data generation.
- CHAINEVAL is a dynamic alignment measure for final-answer and step-level reasoning.
- 26 leading LLMs were evaluated.
- Frontier LLMs show clear limitations in multi-step symbolic reasoning.
- Existing datasets like FinQA and ConvFinQA neglect intermediate reasoning steps.
Entities
—