Entity · model

Qwen3-1.7B

modelactiveqwen3-1-7b-5b2ffb2a·8 events·first seen May 19, 2026

Aliases: Qwen3-1.7B

Co-occurring entities

More like this (12)

Qwen3-1.7B-Base Qwen3.5-0.8B Qwen-0.5B Qwen1.5-110B Qwen2.5-1.5B Qwen1.5-MoE-A2.7B Qwen1.5-7B Qwen2.5-0.5B Qwen1.5-72B Qwen 2.5-7B Qwen3-4B Qwen 3.7

Recent events (8)

5arXiv · cs.AI·3d ago·source ↗

Relay-OPD addresses prefix failure in on-policy distillation via teacher-student trajectory handoff

Researchers introduce Relay On-Policy Distillation (Relay-OPD), a training method that addresses 'prefix failure' in on-policy knowledge distillation, where student models compound early reasoning errors throughout a trajectory. The approach detects divergence points where teacher and student continuations asymmetrically diverge, then briefly hands generation to the teacher to produce a corrective 'relay leg' before the student resumes. Evaluated on eight mathematical reasoning benchmarks using Qwen3-4B-Instruct-2507 as teacher and Qwen3-0.6B/1.7B as students, Relay-OPD outperforms standard OPD by +5.73% and the strongest baseline FastOPD by +1.49% on average for the 1.7B model, while also reducing training trajectory length by over 50%.

Open Weights Progress Alignment and RLHF on-policy distillation Qwen3.5-0.8B Pass the Baton: Trajectory-Relayed On-Policy Distillation +3 more

6arXiv · cs.CL·Jul 9, 2026·source ↗

AdaPrefix-GRPO: Adaptive prefix control doubles GRPO accuracy on hard math reasoning

A new arXiv preprint introduces AdaPrefix-GRPO, a method that addresses GRPO's failure to learn from problems where no rollout succeeds by prepending correct solution prefixes and dynamically adjusting prefix length to maintain ~50% success rate per problem throughout training. The prefix assistance is gradually withdrawn so the final model solves problems unaided. On hard math benchmarks, the method more than doubles GRPO accuracy for a 0.6B model (2.1x) and achieves 1.7x improvement on AIME, while halving trace length, with larger gains on smaller models. The implementation requires only data preparation changes and a loss mask, leaving the trainer unchanged.

Evaluation and Benchmarking Alignment and RLHF AIME Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems Qwen3-1.7B +2 more

6arXiv · cs.CL·Jul 7, 2026·source ↗

Direct On-Policy Distillation transfers RL policy shifts from weak to strong models

Researchers propose Direct-OPD (Direct On-Policy Distillation), a method for transferring the policy shift induced by reinforcement learning on a small model to a larger target model, bypassing the need to run expensive RL rollouts on the stronger model. The approach uses the log-ratio between a post-RL teacher and its pre-RL reference as a dense implicit reward signal applied to the student's own on-policy states. Empirically, Direct-OPD improves Qwen3-1.7B from 48.3% to 62.4% on AIME 2024 in 4 hours on 8 A100 GPUs, outperforming step-matched direct RL. The method addresses a key scalability bottleneck in post-training as frontier models grow larger.

Training Infrastructure Frontier Model Releases on-policy distillation AIME 2026 Weak-to-Strong Generalization via Direct On-Policy Distillation +5 more

6arXiv · cs.LG·Jun 30, 2026·source ↗

High offline conservatism in DPO amplifies reward hacking during online adaptation, study finds

A new arXiv paper challenges the conventional wisdom that conservative offline training (via DPO with high β) provides a safer foundation for online RL adaptation. Experiments with Qwen3-14B show that higher offline conservatism monotonically increases reward hacking damage (Goodhart gap) during online adaptation, with Spearman ρ=1.0 across conditions. The mechanistic explanation is a three-link chain: high-β DPO compresses policy entropy, reducing response diversity and concentrating outputs in a narrow reward-model region, while paradoxically increasing ensemble disagreement that gets exploited during online optimization. The authors identify a practical optimal conservatism level β* and argue the field needs calibrated rather than maximal conservatism.

Evaluation and Benchmarking AI Safety Research Qwen3-14B Direct Preference Optimization (DPO)Qwen3-1.7B +3 more

6arXiv · cs.AI·Jun 12, 2026·source ↗

RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy

Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.

Evaluation and Benchmarking Alignment and RLHF RA-RFT Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning GRPO +3 more

6arXiv · cs.AI·May 28, 2026·source ↗

Skill-Conditioned Gated Self-Distillation (SGSD) for LLM Reasoning

SGSD is a new on-policy self-distillation method for LLM reasoning that replaces trusted privileged information (e.g., reference answers) with an experience-derived skill bank of skill-mistake pairs. It constructs a multi-teacher pool, validates each teacher's contribution via a verifier, and applies a gated objective to distill informative disagreements while suppressing noisy signals. On Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and answer-conditioned OPSD by 1.7% on average across AIME24, AIME25, and HMMT25. The method relaxes the assumption of trusted privileged information, making self-distillation more practical under weaker supervision.

Frontier Model Releases Evaluation and Benchmarking OPSD AIME24 SGSD +7 more

5arXiv · cs.CL·May 21, 2026·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more

6arXiv · cs.CL·May 19, 2026·source ↗

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

This paper introduces IH-GRPO, a reinforcement learning algorithm that decouples tool invocation from immediate execution during LLM reasoning, addressing the coherence disruption caused by tight coupling in existing tool-integrated reasoning (TIR) approaches. The authors propose a hierarchical control framework and derive a surrogate loss enabling an implicitly hierarchical policy to match the behavior of an explicit hierarchical policy. Experiments on Qwen3 models (1.7B, 4B, 8B) show absolute improvements of 1.87–2.53% across six out-of-domain mathematical reasoning benchmarks over the strongest baseline. Code is publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem GRPO Tool-Integrated Reasoning Qwen3-4B +3 more