OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

ai-technology · 2026-05-16

A novel compute framework for test-time, known as OpenDeepThink, employs a pairwise Bradley-Terry comparison to identify the most effective LLM reasoning candidates. This technique evaluates several candidates simultaneously, allowing the LLM to assess random pairs and compile votes into an overall ranking. The highest-ranked candidates are retained, with the top 75% being modified based on natural-language feedback from the evaluations, while the lowest 25% are eliminated. In experiments, OpenDeepThink improved the effective Codeforces Elo of Gemini 3.1 Pro by +405 points over eight consecutive LLM-call rounds, taking about 27 minutes. This method tackles the challenge of selecting the best candidate without a ground-truth verifier, as pointwise LLM assessments tend to be noisy and biased. The research can be found on arXiv with the identifier 2605.15177.

Key facts

OpenDeepThink uses pairwise Bradley-Terry comparison for candidate selection.
The LLM judges random pairs of candidates and aggregates votes into a global ranking.
Top-ranked candidates are preserved; top three-quarters are mutated using natural-language critiques.
Bottom quarter of candidates is discarded each generation.
Gemini 3.1 Pro's Codeforces Elo increased by +405 points in eight rounds (~27 minutes).
The method scales breadth by sampling multiple candidates in parallel.
It addresses the selection bottleneck without a ground-truth verifier.
Paper available on arXiv: 2605.15177.

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

Key facts

Entities

Institutions

Sources