New Benchmark Tests LLM Reasoning for Financial Time-Series Analysis
A new evaluation framework and benchmark have been developed to effectively assess the reasoning abilities of Large Language Models (LLMs) in handling complex financial tasks. This innovative approach tackles the significant issue of evaluating LLMs in quantitative finance, as traditional benchmarks often do not effectively measure an agent's fundamental skills in interpreting queries and managing computations. Utilizing the Time Series Augmented Generation (TSAG) framework, a comprehensive empirical study was conducted where an LLM agent assigns quantitative tasks to reliable external tools. The benchmark features 100 financial questions to evaluate various leading agents, including GPT-4o, Llama 3, and Qwen2, focusing on metrics such as tool selection accuracy, fidelity, and hallucination. Findings indicate that proficient agents can achieve nearly flawless tool usage with minimal hallucination. This research is documented on arXiv under identifier 2604.19633v1.
Key facts
- A new evaluation methodology and benchmark measures LLM reasoning for financial time-series analysis
- Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations
- The Time Series Augmented Generation (TSAG) framework was used in a large-scale empirical study
- The benchmark consists of 100 financial questions
- Multiple SOTA agents were compared including GPT-4o, Llama 3, and Qwen2
- Metrics assess tool selection accuracy, faithfulness, and hallucination
- Capable agents can achieve near-perfect tool-use accuracy with minimal hallucination
- The research was announced on arXiv with identifier 2604.19633v1
Entities
Institutions
- arXiv