Almanac
model

Qwen2.5-Math-PRM

modelactiveqwen2-5-math-prm-0e54452c·5 events·first seen 1mo ago

Aliases: Qwen2.5-Math-PRM, Qwen2.5-Math, Qwen2-Math, Qwen2.5-Math-1.5B

Co-occurring entities

More like this (12)

Recent events (5)

7Qwen Research·1mo ago·source ↗

Qwen2.5-Math: Open-Source Mathematical LLM Series Released

Alibaba's Qwen team has released Qwen2.5-Math, an upgraded series of open-source mathematical LLMs including base and instruction-tuned models at 1.5B, 7B, and 72B parameter scales, plus a mathematical reward model. The models support Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR) for English and Chinese math problem solving. This follows the Qwen2-Math release approximately one month prior and is claimed to be the leading open-source mathematical LLM series.

6Qwen Research·1mo ago·source ↗

Introducing Qwen2-Math: Math-Specialized LLMs from Alibaba's Qwen Team

Alibaba's Qwen team has released Qwen2-Math and Qwen2-Math-Instruct, a series of math-specialized large language models built on the Qwen2 architecture. The models are designed to enhance arithmetic and mathematical reasoning capabilities in LLMs. The initial release supports English only, with bilingual English/Chinese versions announced as forthcoming.

6Qwen Research·1mo ago·source ↗

Qwen2.5-Math Process Reward Model for Mathematical Reasoning Supervision

Alibaba's Qwen team introduces a process reward model (PRM) aimed at improving the reliability of mathematical reasoning in LLMs by supervising intermediate reasoning steps rather than only final answers. The work addresses the problem of models producing plausible but flawed intermediate derivations even when reaching correct conclusions. The release includes model weights on HuggingFace and ModelScope alongside a GitHub repository.

6arXiv · cs.CL·21d ago·source ↗

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

7arXiv · cs.CL·26d ago·source ↗

RELEX: Extrapolating LLM RLVR Training via Rank-1 Parameter Trajectories

This paper demonstrates that RLVR weight update trajectories are extremely low-rank and near-linearly predictable, with a rank-1 approximation capturing most downstream performance gains. The authors propose RELEX, a compute-efficient method that observes a short training window, estimates the rank-1 subspace, and extrapolates future checkpoints via linear regression—requiring no additional training. Evaluated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, RELEX matches or exceeds full RLVR performance using as few as 15% of training steps, and can extrapolate up to 10–20× beyond the observed prefix. The authors attribute the method's effectiveness to a denoising effect from rank-1 projection that discards stochastic optimization noise.