LLMbench Workbench Enables Comparative Close Reading of Large Language Model Outputs
LLMbench is a web-based platform intended for the comparative close reading of outputs from large language models, distinguishing itself from quantitative assessment tools such as Google PAIR's LLM Comparator. It integrates into the practices of digital humanities hermeneutics by allowing users to view two model responses in adjacent, annotatable panels. Users can utilize four analytical overlays: Probabilities for examining token-level log-probabilities, Differences for word-level comparisons, Tone for analyzing Hyland-style metadiscourse, and Structure for parsing sentences with highlighted discourse connectives. The five analytical modes—Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence—clarify the probabilistic nature of generated text. Announced on arXiv with identifier 2604.15508v1, the tool emphasizes interpretive analysis over mere performance metrics, merging computational linguistics with humanistic inquiry.
Key facts
- LLMbench is a browser-based workbench for comparative close reading of LLM outputs
- It contrasts with quantitative evaluation tools like Google PAIR's LLM Comparator
- The tool is oriented toward digital humanities hermeneutic practices
- It displays two model responses side-by-side in annotatable panels
- Four analytical overlays include Probabilities, Differences, Tone, and Structure
- Five analytical modes examine Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence
- The tool makes probabilistic structure of generated text legible at token level
- It was announced on arXiv under identifier 2604.15508v1
Entities
Institutions
- Google PAIR