New Benchmark Evaluates Math Formula Extraction from PDFs

other · 2026-05-07

A new benchmarking framework has been introduced to tackle the challenges of pulling mathematical formulas from PDFs, which is crucial for training large language models and building scientific knowledge bases. Existing benchmarks often overlook formulas or don’t provide evaluations that consider their meanings. This framework makes use of synthetically created PDFs with precise LaTeX ground truth, enabling controlled experimentation with layout and content. The evaluation involves using LLMs to assess the semantic similarity of parsed formulas, revealing deeper mathematical insights rather than just surface-level notation. A study with 250 formula pairs and 750 ratings from 30 reviewers demonstrated its success, achieving a Pearson correlation of r=0.78, compared to r=0.34 for character-level matching and nearly r=0 for text similarity. The method incorporates a two-stage matching process that combines LLM extraction with semantic evaluation.

Key facts

arXiv:2512.09874v2 is a paper on benchmarking document parsers for mathematical formula extraction from PDFs.
The framework uses synthetically generated PDFs with LaTeX ground truth.
Evaluation uses LLM-as-a-judge for semantic equivalence of formulas.
Human study: 250 formula pairs, 750 ratings from 30 evaluators.
Pearson correlation of r=0.78 with human judgment achieved.
Character-level matching (CDM) achieved r=0.34; text similarity r~0.
Two-stage matching pipeline combines LLM-based extraction with semantic evaluation.
Existing benchmarks exclude formulas or lack semantically-aware metrics.

Entities

—

Sources

arXiv cs.AI — 2026-05-06