Spurious Rewards: Rethinking Training Signals in RLVR

*Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer*

*Equal Contribution

Paper: https://arxiv.org/abs/2506.10947

Github: https://github.com/ruixin31/Rethink_RLVR

<aside> 💡

TL;DR

We show that you can do RLVR on Qwen2.5-Math models with completely random or incorrect rewards, and still get massive math benchmark gains.

All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B:

RLVR + format reward (reward responses with \\boxed{}) 🔲: +13.8%
RLVR + incorrect reward (only incorrect answers rewarded) 😈: +24.1%
RLVR + random reward 🎲: +21.4%
(as a reference) RLVR + ground-truth reward ✅: + 29.1%

🤯 How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

Find the answers in our analysis below 👇

</aside>

Questioning Conventional Wisdom on RLVR

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard approach for enhancing reasoning capabilities in large language models (LLMs). The conventional wisdom suggests that high-quality supervision signals are essential for effective RLVR training. Recent works challenges this assumption, showing training on one example or unsupervised examples with RLVR can still lead to significant gains on Qwen-Math models.

Therefore, we wonder—where do the training signals in 1-shot or unsupervised RLVR come from? What is a minimum requirement for rewards to provide meaningful RLVR training signals?

Our findings shocked us.

Spurious Rewards, Even Random or Incorrect, Can Also Significantly Boosts Qwen-Math Performance

We discovered that RLVR can produce substantial improvements in mathematical reasoning using what we call "spurious rewards"—signals that provide minimal or even misleading guidance.

Here are a few fun rewards that we played with:

Format rewards: Rewarding responses simply for containing a \boxed{ } answer
Random rewards: Completely arbitrary feedback
Incorrect rewards: Purposely wrong supervision signals

We also compare with a few more weak rewards that have be studied in the literature:

Majority vote rewards: Take the majority-voted answer as the label.
One-shot RL: Conduct standard RLVR on a single example.