CoREB: A Multitask Benchmark and Model for Code Search Beyond Retrieval

publication · 2026-05-07

A new benchmark for code retrieval and reranking, named CoREB, has been developed by researchers, along with a specialized code reranker. This benchmark overcomes the shortcomings of current code search assessments by encompassing the entire process, which includes reranking and queries styled after developer needs. CoREB is derived from counterfactually rewritten problems from LiveCodeBench across five programming languages, released in stages with graded relevance assessments. The study evaluated eleven embedding models and five rerankers over three tasks: text-to-code, code-to-text, and code-to-code. Findings indicate that embeddings tailored for code excel in code-to-code retrieval, while reranking enhances overall task performance. The initiative aims to minimize data contamination, label inaccuracies, and binary relevance issues in code search evaluations.

Key facts

CoREB is a contamination-limited multitask benchmark for code retrieval and reranking.
The benchmark is built from counterfactually rewritten LiveCodeBench problems.
It covers five programming languages.
Timed releases with graded relevance judgments are used.
Eleven embedding models and five rerankers were benchmarked.
Three tasks: text-to-code, code-to-text, and code-to-code.
Code-specialized embeddings dominate code-to-code retrieval.
Reranking significantly improves performance across tasks.

CoREB: A Multitask Benchmark and Model for Code Search Beyond Retrieval

Key facts

Entities

Institutions

Sources