HAI-Eval: Benchmarking Human-AI Synergy in Collaborative Coding
A new benchmark called HAI-Eval measures the synergy of human-AI partnerships in collaborative coding. Developed by researchers, it addresses the gap left by traditional human tests and LLM benchmarks, which focus on well-defined algorithmic problems. HAI-Eval uses 45 'Collaboration-Necessary' problem templates that are intractable for standalone LLMs or unaided humans but solvable through effective collaboration. The benchmark provides a standardized IDE for human participants to dynamically create tasks, aiming to capture the shift where success depends on human reasoning and AI efficiency.
Key facts
- HAI-Eval is a unified benchmark for human-AI synergy in coding.
- It uses 45 'Collaboration-Necessary' problem templates.
- Problems are intractable for standalone LLMs or unaided humans.
- The benchmark provides a standardized IDE for human participants.
- It addresses the shift in development paradigm due to LLM-powered coding agents.
- Existing evaluation systems fail to capture collaborative problems.
- Collaborative problems require human reasoning and AI efficiency.
- HAI-Eval dynamically creates tasks from templates.
Entities
Institutions
- arXiv