Almanac
model

GLM-Z1-9B-0414

modelactiveprovisionalglm-z1-9b-0414-479e4967·1 events·first seen 2d ago

Aliases: GLM-Z1-9B-0414

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·2d ago·source ↗

RiVER framework enables RL training of LLMs on tasks without ground-truth solutions

Researchers introduce RiVER (Ranking-induced VERifiable framework), a reinforcement learning approach that trains LLMs on score-based optimization tasks using deterministic execution feedback as continuous rewards, without requiring ground-truth answers. The method addresses two failure modes in group-relative RL with continuous rewards—scale dominance and frequency dominance—via calibrated, instance-wise reward shaping. Applied to Qwen3-8B and GLM-Z1-9B-0414 on competitive programming tasks, RiVER improves ALE rating rank by ~9% and also transfers to exact-solution benchmarks (LiveCodeBench, USACO) with 2-4% absolute gains, unlike raw-score baselines. The result suggests score-based heuristic tasks can serve as general-purpose RL training environments for coding ability.