LLM-as-a-Judge Framework for Evaluating Sustainable City Trip Recommendations

ai-technology · 2026-04-29

A recent research paper introduces a calibration framework in three phases for utilizing LLMs as evaluators of sustainable city trip suggestions, focusing on four key aspects: relevance, diversity, sustainability, and popularity balance. This study, which has been submitted to arXiv, tackles the difficulty of assessing intricate conversational travel recommendations, particularly when human annotations are expensive and conventional metrics overlook stakeholder-centric objectives. The framework encompasses initial assessments with various LLMs, expert reviews to pinpoint systematic discrepancies, and calibration tailored to specific dimensions using rules and few-shot examples. The researchers found model-specific biases and significant variance across dimensions in two recommendation contexts. Calibration improved reasoning clarity for each dimension but revealed differing interpretations of sustainability, underscoring the necessity for transparent, bias-aware evaluations of LLMs. Prompts and code have been made available for reproducibility.

Key facts

The paper studies LLMs-as-Judges for sustainable city-trip lists.
Four evaluation dimensions: relevance, diversity, sustainability, and popularity balance.
Three-phase calibration framework: baseline judging, expert evaluation, dimension-specific calibration.
Model-specific biases and high dimension-level variance were observed.
Calibration exposed divergent interpretations of sustainability.
Prompts and code are released for reproducibility.
The paper is submitted to arXiv under Computer Science > Artificial Intelligence.
The study uses multiple LLMs for baseline judging.

LLM-as-a-Judge Framework for Evaluating Sustainable City Trip Recommendations

Key facts

Entities

Institutions

Sources