Entity · model

Qwen3-4B-Instruct

modelactiveqwen3-4b-instruct-71517384·9 events·first seen May 25, 2026

Aliases: Qwen3-4B-Instruct, Qwen3-VL-4B-Instruct, Qwen3-4B-Instruct-2507, Qwen3-VL-8B-Instruct, Qwen3-8B-Instruct

Co-occurring entities

More like this (12)

Qwen3-30B-A3B-Instruct Qwen2-Audio-7B-Instruct Qwen2.5-7B-Instruct-1M Qwen3-14B Qwen3-4B Qwen 2.5 32B Instruct Qwen2.5-3B Qwen2.5-VL-32B-Instruct Qwen2.5-Coder-32B-Instruct Qwen3-30B Qwen3-4B-Thinking-2507 Qwen 3.5 27B

Recent events (9)

5arXiv · cs.AI·3d ago·source ↗

Relay-OPD addresses prefix failure in on-policy distillation via teacher-student trajectory handoff

Researchers introduce Relay On-Policy Distillation (Relay-OPD), a training method that addresses 'prefix failure' in on-policy knowledge distillation, where student models compound early reasoning errors throughout a trajectory. The approach detects divergence points where teacher and student continuations asymmetrically diverge, then briefly hands generation to the teacher to produce a corrective 'relay leg' before the student resumes. Evaluated on eight mathematical reasoning benchmarks using Qwen3-4B-Instruct-2507 as teacher and Qwen3-0.6B/1.7B as students, Relay-OPD outperforms standard OPD by +5.73% and the strongest baseline FastOPD by +1.49% on average for the 1.7B model, while also reducing training trajectory length by over 50%.

Open Weights Progress Alignment and RLHF on-policy distillation Qwen3.5-0.8B Pass the Baton: Trajectory-Relayed On-Policy Distillation +3 more

6arXiv · cs.CL·Jul 14, 2026·source ↗

SCOPE-RL framework densifies reward signals in RLVR to improve reasoning accuracy and token efficiency

SCOPE-RL introduces a two-stage reinforcement learning framework that addresses sparse reward limitations in RLVR by adding prefix-decomposed sub-question rewards before success and correctness-gated process-shape rewards after success. Applied to Qwen3-8B-Instruct on math reasoning datasets, the method improves average accuracy by up to 11.2 percentage points and reduces reasoning tokens by up to 27.1% over outcome-only GRPO. The gains generalize across GSPO and a smaller Qwen3-0.6B model, suggesting reward-signal densification is broadly complementary to existing RLVR advances. Code and data are publicly released.

Evaluation and Benchmarking Alignment and RLHF Qwen3.5-0.8B GRPO DAPO-Math +4 more

5The Batch·Jul 3, 2026·source ↗

RoboReward: Vision-Language Reward Models for Robot Training via RL

Researchers at Stanford and UC Berkeley developed RoboReward, a family of 4B and 8B vision-language reward models designed to provide reward signals for robot reinforcement learning across diverse robot types and tasks. The team built a novel dataset by augmenting successful robot demonstrations with synthetically generated failure examples using GPT-5 mini and Qwen3-4B, then fine-tuned Qwen3-VL models to predict task progress scores. RoboReward 8B outperformed GPT-5, GPT-5 mini, and Gemini Robotics-ER 1.5 on the new RoboRewardBench evaluation, and in real-world robot trials substantially exceeded prior reward model baselines while still falling short of human-assigned rewards. The authors also release RoboRewardBench as a community benchmark for reward model evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem DeepLearning.AI Stanford University UC Berkeley +12 more

5arXiv · cs.CL·Jul 1, 2026·source ↗

VLMs overestimate common ground in asymmetric dialogue, conflating potential with established shared understanding

A new arXiv paper investigates whether vision-language models can distinguish between what could be shared versus what has actually been established as shared between dialogue participants. Using 13,077 annotated reference expressions from HCRC MapTask dialogues, the authors find that VLMs systematically over-predict alignment when given task-relevant map content—whether presented visually or as text—suggesting the bias stems from static referential cues rather than tracking grounding through dialogue history. The effect is observed most strongly in Qwen3-VL-8B-Instruct and replicated across four additional models from two architecture families, revealing a fundamental limitation in how current VLMs model collaborative dialogue.

Evaluation and Benchmarking Multimodal Progress Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue HCRC MapTask Qwen3-4B-Instruct

6The Batch·Jun 19, 2026·source ↗

POPE Training Method Uses Partial Solution Hints to Improve RL Exploration in LLMs

Researchers from Carnegie Mellon University introduced Privileged On-Policy Exploration (POPE), a training method that pairs GRPO reinforcement learning with hint-augmented datasets to help LLMs solve hard problems they would otherwise fail to explore. During training, the model receives partial solution prefixes alongside full problems, enabling it to discover complete solutions; it is then trained on both hinted and unhinted versions so it learns to solve problems without hints at inference time. On competition math benchmarks AIME 2025 and HMMT 2025, POPE outperforms standard GRPO and supervised fine-tuning, with HMMT pass@1 improving from 31.0% to 37.8%. The method addresses a core bottleneck in RL training—sparse reward exploration—by decomposing hard problem-solving into finding a good starting state and completing the solution.

Evaluation and Benchmarking Alignment and RLHF Virginia Smith Carnegie Mellon University Aviral Kumar +8 more

6arXiv · cs.CL·Jun 12, 2026·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

Agent and Tool Ecosystem Multimodal Progress LabVLA LabUtopia Qwen3-4B-Instruct +3 more

5arXiv · cs.AI·Jun 10, 2026·source ↗

EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents

EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.

Evaluation and Benchmarking Agent and Tool Ecosystem ACE DeepSeek V4 Qwen3-4B-Instruct +2 more

7arXiv · cs.CL·May 26, 2026·source ↗

MobileGym: Verifiable Parallel Simulation Platform for Mobile GUI Agent Training

MobileGym is a browser-hosted simulation environment for mobile GUI agent research that enables deterministic outcome verification via structured JSON state and scalable online RL through hundreds of parallel instances (~400 MB/instance, ~3s cold start). The accompanying MobileGym-Bench provides 416 parameterized task templates across 28 apps with deterministic judges. A sim-to-real case study using GRPO on Qwen3-VL-4B-Instruct achieves +12.8 percentage points on the 256-task test set, with real-device execution retaining 95.1% of simulation-side training gains.

Evaluation and Benchmarking Inference Economics GRPO MobileGym-Bench Qwen3-4B-Instruct +4 more

6arXiv · cs.LG·May 25, 2026·source ↗

Training-Free Looped Transformers: Inference-Time Recurrence via ODE-Motivated Layer Reapplication

The paper introduces a method to retrofit recurrence onto frozen pretrained transformer checkpoints at inference time by looping a contiguous mid-stack block of layers without any fine-tuning or architectural changes. Naive block reapplication degrades performance, so the authors motivate their approach by treating pre-norm transformer blocks as forward Euler ODE steps and replacing one large update with smaller damped sub-steps. Evaluated across seven model families including dense, sparse MoE, and MLA+MoE architectures, the method yields consistent benchmark improvements (e.g., +2.64 pp on MMLU-Pro for Qwen3-4B-Instruct) at no training cost.

Frontier Model Releases Inference Economics CommonsenseQA OpenBookQA Forward Euler ODE +6 more