CopT Reverses Chain-of-Thought for Efficient LLM Reasoning

ai-technology · 2026-05-20

A new paper introduces CopT (Contrastive On-Policy Thinking), a reasoning pipeline that reverses the traditional chain-of-thought (CoT) order. Instead of thinking before answering, CopT first generates a draft answer, then performs on-policy thinking conditioned on that draft for reflection and correction. It uses continuous embeddings as inference-time contrastive verifiers to assess draft answer trustworthiness. The approach aims to reduce token costs and avoid performative reasoning. The paper is published on arXiv under ID 2605.20075.

Key facts

CopT reverses the order of thinking and answering in LLM reasoning.
It first elicits a draft answer, then invokes on-policy thinking for reflection.
Continuous embeddings are used as contrastive verifiers at inference time.
The approach targets performative reasoning and unnecessary token costs.
The paper is available on arXiv with ID 2605.20075.

CopT Reverses Chain-of-Thought for Efficient LLM Reasoning

Key facts

Entities

Institutions

Sources