LinAlg-Bench Reveals LLM Math Failure Threshold at 4x4 Matrices
LinAlg-Bench is an innovative diagnostic tool that tests 10 leading large language models on structured linear algebra tasks involving 3x3, 4x4, and 5x5 matrices. Covering 9 different types of tasks and 660 problems certified by SymPy, it thoroughly evaluates 6,600 outputs from the models. In addition to measuring binary accuracy, the benchmark features a three-stage automated forensic process that categorized 1,156 failures into ten main error types with detailed subcategories. A key observation indicates a distinct behavioral threshold at the 4x4 matrix size: models below this scale encounter execution errors like sign tracking issues, arithmetic drift, and parity mistakes, while those above it tend to abandon computation, often fabricating answers through tool roleplay and constraint-consistent confabulation.
Key facts
- LinAlg-Bench evaluates 10 frontier large language models.
- Benchmark covers 3x3, 4x4, and 5x5 matrices.
- Includes 9 task types and 660 SymPy-certified problems.
- Total of 6,600 model outputs evaluated.
- Three-stage automated forensic pipeline classifies failures.
- 1,156 failures classified into ten primary error tags.
- Sharp behavioral threshold at 4x4 scale identified.
- Below 4x4: execution errors like sign tracking failures.
- Above 4x4: computational abandonment and confabulation.
Entities
Institutions
- arXiv
- SymPy