Neuro-Symbolic Framework Teaches Transformers to Cube for SAT
Researchers have introduced an innovative neuro-symbolic system for post-training that enables transformer-based models to learn cubing strategies for Boolean Satisfiability (SAT) challenges, which was previously impossible. This system features a data curation process based on MCTS that leverages symbolic strategies to make informed decisions while evaluating SAT competition formulas. It generates preference data based on solver statistics, enhanced by insights from a teacher model. In a two-step post-training approach—first through supervised fine-tuning (SFT) and then direct preference optimization (DPO)—a model with 4 billion parameters achieved a pass@5 score of 53 across 100 SAT benchmarks, outperforming top models like Claude-Sonnet-4, which scored 50. Studies show SFT improved scores from 46 to 51, while DPO added 2 more benchmarks.
Key facts
- First demonstration that transformer-based models can learn effective cubing heuristics for SAT.
- Neuro-symbolic post-training framework introduced.
- MCTS-based data curation pipeline uses symbolic heuristics.
- Preference data grounded in solver statistics and augmented with reasoning traces.
- Two-stage post-training: SFT followed by DPO.
- 4B-parameter model achieves pass@5 of 53 on 100 SAT competition benchmarks.
- Surpasses Claude-Sonnet-4 (50) and matches best symbolic heuristic (53).
- SFT alone improves pass@5 from 46 to 51; DPO adds 2 benchmarks.
Entities
—