Rate or Fate?
Does noisy reward change the rate of learning or its fate? We analyze when noise is a usable signal that merely slows progress, and when it becomes harmful and pushes learning in the wrong direction.
RLVR is simple but powerful: sample an answer, verify it, and update the model. But in practice
the verifier is almost never clean-unit tests probe only finitely many corner cases,
human/synthetic labels are imperfect, and LLM judges can be biased and prone to reward hacking- and
this noise only worsens as we push to harder domains.
We leverage a novel analytical framework to study the evolution of algorithms like GRPO under general noise levels.
Modeling each prompt as a multi-armed bandit over recurring reasoning modes, we derive a tractable
probability-simplex flow with a sharp noise threshold. The dynamics decouple into
inner competition among correct modes and an outer mean-field ODE for the total bad-mode mass \(p(t)\),
whose drift depends only on Youden's index \(J = \text{TPR} - \text{FPR}\).
A single scalar—Youden's \(J = \text{TPR} - \text{FPR}\)—controls whether RLVR learns or anti-learns. The transition at \(J=0\) is sharp and predictable.
When \(J > 0\), noise primarily affects the rate of convergence, not the ultimate outcome. Learning still succeeds—just slower.
We model each prompt as a bandit over coarse-grained "reasoning modes," enabling tractable analysis of the full learning dynamics.
KL regularization can rescue learning even when \(J < 0\), preventing complete collapse by anchoring to the reference policy.
The dynamics live on the probability simplex with Shahshahani geometry, yielding replicator-style natural selection dynamics.
The core insight is that RLVR dynamics on the probability simplex exhibit a phase transition governed by the verifier's discriminative power. The total mass on incorrect ("bad") reasoning modes evolves according to:
This is a mean-field ODE where the sign of \(J\) determines the direction of flow. When \(J > 0\), bad mass shrinks exponentially; when \(J < 0\), it grows to dominate.
We coarse-grain the LLM's response distribution into discrete "reasoning modes"—recurring patterns of problem-solving strategies. This transforms the intractable token-level dynamics into a multi-armed bandit where each arm represents a mode.
Even when \(J > 0\) and learning succeeds, GRPO exhibits winner-take-all dynamics among correct reasoning modes. The separatrices—basin boundaries shown as dashed lines—divide the simplex into regions of attraction. Whichever good mode has a slight initial advantage captures all the probability mass.
This is a feature, not a bug: GRPO amplifies small fitness differences into decisive outcomes. But it also means that diversity among correct solutions collapses—the model converges to a single "winning" strategy even when multiple valid approaches exist.
This winner-take-all behavior is governed by the inner dynamics on the good-arm subspace. The competition follows replicator dynamics where fitness is proportional to probability, creating a rich-get-richer effect that eliminates diversity over time.
The natural metric for probability distributions is the Shahshahani metric (Fisher-Rao metric), under which the GRPO update becomes a natural gradient flow:
This geometry explains key phenomena: low-probability modes are "far away" and hard to eliminate, geodesics curve away from boundaries, and the dynamics naturally preserve the simplex structure.
When the verifier is adversarial (\(J < 0\)), unregularized RLVR collapses to bad modes. However, KL regularization to a reference policy can rescue learning by creating interior fixed points that prevent complete collapse.
We validate our theoretical predictions with controlled GRPO fine-tuning experiments on programmatically verified coding tasks, injecting synthetic noise at controlled FP/FN rates.
The experiments demonstrate the "rate, not fate" phenomenon: even substantial noise (up to 40% combined error rate) still allows learning when \(J\) remains positive, albeit at reduced speed.
A surprising finding: the learning rate is maximized when the bad mass is at intermediate values, not when starting from near-correct policies. This is because group normalization creates the strongest selection pressure when good and bad modes are balanced.
@article{anonymous2025rateorfate,
title={Rate or Fate? RLV$^\epsilon$R: Reinforcement Learning
with Verifiable Noisy Rewards},
author={Anonymous},
journal={arXiv preprint},
year={2025}
}