Local LLMs and Layout-Aware Parsing for Tabular PDF Extraction: A Reliability Study

other · 2026-05-25

A research investigation assesses the effectiveness of extracting structured data from academic PDFs presented in tables, specifically focusing on Indonesian course registration documents (Kartu Rencana Studi). The study contrasts three methods: LLM only, a hybrid deterministic-LLM approach (regex & LLM), and a Camelot-based pipeline with LLM fallback. The experiments utilize 140 documents for LLM tests and 860 for the Camelot pipeline, spanning four different study programs. Three LLM models (Gemma 3, Phi 4, Qwen 2.5), each with 12-14 billion parameters, are executed locally using Ollama on standard consumer CPUs. This research tackles issues such as the integration of mixed free text and tables, variations across programs, and Unicode artifacts.

Key facts

Study evaluates tabular PDF extraction reliability using Indonesian KRS documents.
Three strategies compared: LLM only, hybrid deterministic-LLM, Camelot with LLM fallback.
140 documents used for LLM tests, 860 for Camelot pipeline evaluation.
Four study programs with varying data in tables and metadata are covered.
Three 12-14B LLM models (Gemma 3, Phi 4, Qwen 2.5) run locally via Ollama on consumer-grade CPU.
Challenges include mixed free text and tables, cross-program variation, and Unicode artifacts.

Entities

Institutions

Ollama

Locations

Indonesia

Sources

arXiv cs.AI — 2026-05-25