The Spurious Rewards team.
🎉 We are happy to share that the ArXiv version of our spurious rewards paper is now available! We have added some new interesting findings in addition to those included in our previous blog post. Check it out here:
[Paper] [Code] [Models] [Wandb]
Figure. The spurious prompt we found that gives high gains on Qwen2.5-Math-7B (e.g., 19.4% improvement on MATH-500 compared to the Qwen default prompt). “{}” is a placeholder for the evaluation question.
In our previous blog post, we discussed how Qwen models can easily benefit from Spurious Rewards, which however does not generalize to other models. We provided our hypothesis and analysis on the training signals from spurious rewards. Our results advocate for rethinking the training signals in RLVR and being cautious about claims on model improvements.
In the full preprint, we've expanded the experiments in our paper, and showed that our results are robust across multiple prompts as well. Additionally, we found another spurious model behavior, which we name Spurious Prompts. We detail our new results and findings below.
<aside> 📎
TL;DR. We find that Qwen2.5-Math-7B's performance can be boosted out-of-the-box by 19.4% using the LaTeX placeholder text generated by \\lipsum
as the evaluation prompts, which we name Spurious Prompts. This prompt outperforms all existing prompts, including Qwen's default prompt, SimpleRL-zoo, and Sober. Is it a feature or a bug? 👀
</aside>
It has been known for a while that LLM evaluation can be sensitive to evaluation prompts (Sclar et al., 2023). Such sensitivity brings issues—for example, different papers may report different baseline numbers, making the absolute numbers less comparable. Therefore, we further investigate models' sensitivity to prompts and discuss its impact on RL findings.
In fact, we have lightly touched on this topic in Section 4.3 of our original paper, showing that by simply adding the sentence "Let's solve this using Python." to the user prompt, we obtained 24.2% and 15.0% gains on Qwen2.5-Math-1.5B and Qwen2.5-Math-7B, respectively.
In this blog post, we consider 3 existing prompts used by the literature: