JASTIN: New Framework for Zero-Shot Audio Evaluation Using LLMs
A novel framework named JASTIN has been introduced by researchers, designed for generalizable instruction-driven audio evaluation. This system treats audio assessment as a self-instructed reasoning challenge, linking a high-performance audio encoder with a fine-tuned LLM backbone through a trainable audio adapter. To guarantee strong zero-shot generalization, a thorough instruction-following data pipeline (Multi-Source, Multi-Task, Multi-Calibration, Multi-Description) is implemented. JASTIN sets new records in Pearson and Spearman correlations when compared to human subjective ratings, effectively addressing shortcomings in current objective metrics and multimodal LLMs concerning domain generalization and instructional adaptability.
Key facts
- JASTIN is a generalizable, instruction-driven audio evaluation framework
- It formulates audio assessment as a self-instructed reasoning task
- Bridges a frozen audio encoder with a fine-tuned LLM backbone via a trainable adapter
- Uses Multi-Source, Multi-Task, Multi-Calibration, Multi-Description data pipeline
- Achieves state-of-the-art correlations with human subjective ratings
- Addresses domain generalization and zero-shot capabilities
- Proposed in arXiv paper 2605.04505
- Focuses on generative audio model evaluation
Entities
Institutions
- arXiv