Research Paper

Rate or Fate?

RLV^εR: Reinforcement Learning
with Verifiable Noisy Rewards

Does noisy reward change the rate of learning or its fate? We analyze when noise is a usable signal that merely slows progress, and when it becomes harmful and pushes learning in the wrong direction.

Ali Rad^1,✉ Khashayar Filom¹ Darioush Keivan¹ Peyman Mohajerin Esfahani² Ehsan Kamalinejad¹

¹Cognichip AI ²University of Toronto

✉ Corresponding: ali@cognichip.ai LinkedIn

Paper arXiv GitHub

Phase transition in RLVR under noisy rewards. The learning dynamics depend critically on the verifier's quality, characterized by Youden's index \(J = \text{TPR} - \text{FPR}\). This single scalar determines whether training improves or degrades the model.

RLVR is simple but powerful: sample an answer, verify it, and update the model. But in practice the verifier is almost never clean-unit tests probe only finitely many corner cases, human/synthetic labels are imperfect, and LLM judges can be biased and prone to reward hacking- and this noise only worsens as we push to harder domains.

We leverage a novel analytical framework to study the evolution of algorithms like GRPO under general noise levels. Modeling each prompt as a multi-armed bandit over recurring reasoning modes, we derive a tractable probability-simplex flow with a sharp noise threshold. The dynamics decouple into inner competition among correct modes and an outer mean-field ODE for the total bad-mode mass \(p(t)\), whose drift depends only on Youden's index \(J = \text{TPR} - \text{FPR}\).

The Critical Threshold

\[ J = \text{TPR} - \text{FPR} = (1 - \text{FN}) - \text{FP} \]

Youden's \(J\) index: a single scalar that determines the fate of learning

J > 0

✓ Learning

Bad mass \(p \to 0\)
Accuracy increases

J = 0

○ Neutral Drift

No net learning
Random walk dynamics

J < 0

✗ Anti-Learning

Bad mass \(p \to 1\)
Accuracy collapses

Key Takeaways

Sharp Phase Transition

A single scalar—Youden's \(J = \text{TPR} - \text{FPR}\)—controls whether RLVR learns or anti-learns. The transition at \(J=0\) is sharp and predictable.

Rate, Not Fate

When \(J > 0\), noise primarily affects the rate of convergence, not the ultimate outcome. Learning still succeeds—just slower.

LLM as Multi-Armed Bandit

We model each prompt as a bandit over coarse-grained "reasoning modes," enabling tractable analysis of the full learning dynamics.

KL Rescue

KL regularization can rescue learning even when \(J < 0\), preventing complete collapse by anchoring to the reference policy.

Simplex Geometry

The dynamics live on the probability simplex with Shahshahani geometry, yielding replicator-style natural selection dynamics.

Phase Transition Dynamics

The core insight is that RLVR dynamics on the probability simplex exhibit a phase transition governed by the verifier's discriminative power. The total mass on incorrect ("bad") reasoning modes evolves according to:

\[ \frac{dp}{dt} =- \eta J \cdot p^{3/2}(1-p)^{3/2} \]

This is a mean-field ODE where the sign of \(J\) determines the direction of flow. When \(J > 0\), bad mass shrinks exponentially; when \(J < 0\), it grows to dominate.

Phase transition animation — **Sweeping through Youden's index.** Watch how the dynamics change as \(J\) varies from negative (anti-learning) to positive (learning).

Phase transition heatmaps — **Phase transition across noise regimes.** Each panel shows how the equilibrium accuracy depends on false positive (FP) and false negative (FN) rates. The critical boundary \(J = 0\) (where \(\text{FP} + \text{FN} = 1\)) separates learning from anti-learning.

LLM as Multi-Armed Bandit

We coarse-grain the LLM's response distribution into discrete "reasoning modes"—recurring patterns of problem-solving strategies. This transforms the intractable token-level dynamics into a multi-armed bandit where each arm represents a mode.

Good arms: Modes that produce correct answers (true positives)
Bad arms: Modes that produce incorrect answers (false positives under noise)
Selection pressure: GRPO's group normalization creates relative fitness competition
Mutation: Sampling noise allows exploration between modes

Flowing particles on simplex — **Particle flow on the simplex.** Watch how probability mass flows between modes under GRPO dynamics.

Three scenarios on the simplex — **Simplex trajectories under different regimes.** The probability simplex shows how mass flows between good and bad modes under different verifier quality settings.

Diversity Collapse Among Good Arms

Even when \(J > 0\) and learning succeeds, GRPO exhibits winner-take-all dynamics among correct reasoning modes. The separatrices—basin boundaries shown as dashed lines—divide the simplex into regions of attraction. Whichever good mode has a slight initial advantage captures all the probability mass.

This is a feature, not a bug: GRPO amplifies small fitness differences into decisive outcomes. But it also means that diversity among correct solutions collapses—the model converges to a single "winning" strategy even when multiple valid approaches exist.

Separatrices and basin structure — **Basin boundaries determine the winner.** The glowing dashed lines are separatrices—trajectories starting in different regions converge to different vertices. Initial conditions determine which good arm dominates.

This winner-take-all behavior is governed by the inner dynamics on the good-arm subspace. The competition follows replicator dynamics where fitness is proportional to probability, creating a rich-get-richer effect that eliminates diversity over time.

Shahshahani Geometry

The natural metric for probability distributions is the Shahshahani metric (Fisher-Rao metric), under which the GRPO update becomes a natural gradient flow:

\[ g_{ij}^{\text{Shah}}(\mathbf{p}) = \frac{\delta_{ij}}{p_i} \]

This geometry explains key phenomena: low-probability modes are "far away" and hard to eliminate, geodesics curve away from boundaries, and the dynamics naturally preserve the simplex structure.

Mesh morphing animation — **Metric structure visualization.** The grid lines show how the Shahshahani metric warps near the simplex boundaries.

Shahshahani geometry visualization — **Shahshahani geometry on the simplex.** The metric structure shows how distances diverge near the boundary, explaining why rare modes are resilient and hard to eliminate through gradient updates.

KL Regularization Rescue

When the verifier is adversarial (\(J < 0\)), unregularized RLVR collapses to bad modes. However, KL regularization to a reference policy can rescue learning by creating interior fixed points that prevent complete collapse.

KL regularization phase diagram — **KL regularization prevents collapse.** The phase diagram shows how increasing the KL penalty \(\beta\) shifts the equilibrium away from the degenerate vertices, preserving diversity even when \(J < 0\).

Experimental Validation

We validate our theoretical predictions with controlled GRPO fine-tuning experiments on programmatically verified coding tasks, injecting synthetic noise at controlled FP/FN rates.

Experimental results — **Experimental validation of the phase transition.** Controlled experiments confirm the predicted \(J=0\) transition. When \(J > 0\) (below the diagonal), accuracy improves; when \(J < 0\) (above the diagonal), accuracy degrades—matching theory.

The experiments demonstrate the "rate, not fate" phenomenon: even substantial noise (up to 40% combined error rate) still allows learning when \(J\) remains positive, albeit at reduced speed.

Maximal Learnability

A surprising finding: the learning rate is maximized when the bad mass is at intermediate values, not when starting from near-correct policies. This is because group normalization creates the strongest selection pressure when good and bad modes are balanced.

Learnability analysis — **Maximal learnability at intermediate bad mass.** The derivative \(|dp/dt|\) peaks around \(p \approx 0.5\), suggesting that starting from a moderately capable policy may actually accelerate learning compared to starting from a very weak or very strong policy.

Citation

@article{anonymous2025rateorfate,
    title={Rate or Fate? RLV$^\epsilon$R: Reinforcement Learning 
           with Verifiable Noisy Rewards},
    author={Anonymous},
    journal={arXiv preprint},
    year={2025}
}

RLVεR: Reinforcement Learningwith Verifiable Noisy Rewards

Key Takeaways

Sharp Phase Transition

Rate, Not Fate

LLM as Multi-Armed Bandit

KL Rescue

Simplex Geometry

Phase Transition Dynamics

LLM as Multi-Armed Bandit

Diversity Collapse Among Good Arms

Shahshahani Geometry

KL Regularization Rescue

Experimental Validation

Maximal Learnability

Citation

RLV^εR: Reinforcement Learning
with Verifiable Noisy Rewards