New Benchmark Evaluates LLMs in Petroleum Engineering

ai-technology · 2026-05-28

A new benchmark called PetroBench has been introduced for assessing large language models (LLMs) specifically in the field of petroleum engineering. This benchmark was created through a comprehensive three-stage approach that included data preprocessing, quality filtering, and validation across multiple models, with expert reviews to confirm relevance to the domain. It features 1,200 questions spanning production, reservoir, and drilling engineering, presented in formats such as multiple-choice, true/false, term definitions, and short answers. Eight prominent LLMs were examined in a standardized API setting. Findings suggest that the models excel in subjective questions compared to objective ones, highlighting deficiencies in factual knowledge identification. The top accuracies recorded were 65.3% for multiple-choice and 74.3% for true/false questions, with Gemini-3-Pro and Kimi-K2.5 among those tested.

Key facts

PetroBench is a benchmark for LLMs in petroleum engineering.
The benchmark includes 1,200 questions across multiple formats.
Covers production, reservoir, and drilling engineering.
Eight mainstream LLMs were evaluated.
Highest multiple-choice accuracy was 65.3%.
Highest true/false accuracy was 74.3%.
Models performed better on subjective than objective questions.
Gemini-3-Pro and Kimi-K2.5 were among models tested.

Entities

—

Sources

arXiv cs.AI — 2026-05-28