LiteParse PDF Text Extraction Now Works in Browser
Simon Willison has ported LlamaIndex's LiteParse PDF text extraction tool to run entirely in the browser. LiteParse uses traditional PDF parsing with Tesseract OCR fallback, not AI models. The browser version is built on PDF.js and Tesseract.js, deployed via GitHub Pages at simonw.github.io/liteparse. Willison developed the project using Claude Code on his iPhone and laptop, with a total build time of 59 minutes. He describes it as a pure "vibe coding" project, having not reviewed any of the generated HTML or TypeScript code. The tool supports spatial text parsing for multi-column layouts, optional OCR, and page image display. Willison has opened an issue with the original LiteParse repository but has not submitted a pull request. The project was announced on 23rd April 2026.
Key facts
- LiteParse is an open source PDF text extraction tool by LlamaIndex.
- The browser version runs entirely client-side using PDF.js and Tesseract.js.
- LiteParse uses traditional parsing and OCR, not AI models.
- The web app is deployed at simonw.github.io/liteparse via GitHub Pages.
- Simon Willison built it using Claude Code in 59 minutes.
- Willison has not reviewed any of the generated code.
- The tool supports spatial text parsing for multi-column layouts.
- Willison opened an issue but not a pull request to the original repo.
Entities
Institutions
- LlamaIndex
- GitHub
- OpenAI