Entity · model

Qwen3-8B-Base

modelactiveprovisionalqwen3-8b-base-82d7f8b0·4 events·first seen May 21, 2026

Aliases: Qwen3-8B-Base, Qwen3.5-0.8B-Base

Co-occurring entities

Qwen Reinforcement Learning with Verifiable Rewards AdamW Isospectral Optimization ISO: An RLVR-Native Optimization Stack Muon Hugging Face DelTA policy gradient token credit assignment Qwen3-14B-Base RLVR Qwen3-4B-Base Qwen2.5-Math-PRM rank-1 approximation Wei Zhepei Alibaba Qwen Team RELEX

More like this (12)

Qwen3-4B-Base Qwen3-14B-Base Qwen3-30B-A3B-Base Qwen3.5-2B-Base Qwen3.5-35B-A3B-Base Qwen3-4B Qwen3-1.7B-Base Qwen3-235B Qwen3-30B-A3B Qwen1.5-32B Qwen1.5-72B Qwen2.5-1.5B-Base

Recent events (4)

6arXiv · cs.LG·Jul 22, 2026·source ↗

ISO: Isospectral Optimization framework for RLVR training efficiency and model merging

Researchers introduce Isospectral Optimization (ISO), a framework that exploits 'spectral inheritance' in RLVR-trained language models — the observation that reward-driven adaptation changes singular frames while preserving base model weight spectra. ISO has two instantiations: ISO-Merger, a data-free method for combining specialist models without gradient updates or on-policy distillation, and ISO-Optimizer, which applies standard optimizers (AdamW, Muon) only to frame variables, achieving equivalent accuracy in roughly 2.7x fewer training steps on Qwen3-8B-Base. The work proposes a principled answer to the underexplored optimization layer between reward signals and weight updates in RLVR pipelines.

Frontier Model Releases Alignment and RLHF Qwen3-8B-Base AdamW Isospectral Optimization +2 more

5Qwen·Jun 5, 2026·source ↗

Qwen releases Qwen3.5-0.8B-Base multimodal model on Hugging Face

Qwen has released Qwen3.5-0.8B-Base, a small 0.8B parameter image-text-to-text base model on Hugging Face. The model supports conversational use and is compatible with Hugging Face endpoints. With nearly 200K downloads, it signals meaningful community uptake for a compact multimodal base model.

Open Weights Progress Multimodal Progress Qwen3-8B-Base Qwen Hugging Face

6arXiv · cs.CL·May 21, 2026·source ↗

DelTA: Discriminative Token Credit Assignment for RLVR Training

DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.

Frontier Model Releases Evaluation and Benchmarking DelTA Qwen3-8B-Base policy gradient +5 more

7arXiv · cs.CL·May 21, 2026·source ↗

RELEX: Extrapolating LLM RLVR Training via Rank-1 Parameter Trajectories

This paper demonstrates that RLVR weight update trajectories are extremely low-rank and near-linearly predictable, with a rank-1 approximation capturing most downstream performance gains. The authors propose RELEX, a compute-efficient method that observes a short training window, estimates the rank-1 subspace, and extrapolates future checkpoints via linear regression—requiring no additional training. Evaluated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, RELEX matches or exceeds full RLVR performance using as few as 15% of training steps, and can extrapolate up to 10–20× beyond the observed prefix. The authors attribute the method's effectiveness to a denoising effect from rank-1 projection that discards stochastic optimization noise.

Training Infrastructure Frontier Model Releases RLVR Qwen3-8B-Base Qwen3-4B-Base +8 more