LLMs Struggle with Exact Computation: PoT Achieves Perfect Accuracy

ai-technology · 2026-05-07

A recent investigation published on arXiv thoroughly examines various prompting techniques aimed at achieving precise, deterministic calculations in Large Language Models (LLMs). This study evaluates Chain-of-Thought (CoT), Least-to-Most decomposition, Program-of-Thought (PoT), and Self-Consistency (SC) across tasks such as binary counting, longest substring identification, and arithmetic assessment. For a controlled analysis, a synthetic dataset with varied natural language prompts was created. Findings reveal that conventional prompting approaches yield only moderate accuracy in sequence-related tasks. While CoT shows slight enhancements, Least-to-Most encounters issues with error accumulation. In contrast, PoT attains flawless accuracy by producing executable code. The research underscores the constraints of existing LLMs in exact computations and the promise of execution-based techniques.

Key facts

arXiv paper 2605.03227 evaluates prompting strategies for deterministic computation in LLMs.
Methods tested: Chain-of-Thought, Least-to-Most, Program-of-Thought, Self-Consistency.
Tasks: binary counting, longest substring detection, arithmetic evaluation.
A synthetic dataset with diverse natural language instructions was introduced.
Standard prompting methods achieve only moderate accuracy on sequence-based tasks.
CoT provides limited improvement; Least-to-Most suffers from error accumulation.
PoT achieves perfect accuracy by generating executable code.
The study underscores LLMs' limitations for exact computation.

LLMs Struggle with Exact Computation: PoT Achieves Perfect Accuracy

Key facts

Entities

Institutions

Sources