
Qwen3-4B
qwen3-4b-7738899d·27 events·first seen 1mo agoAliases: Qwen3-4B, Qwen3-8B, Qwen3-VL-8B, Qwen3.5-4B, Qwen3.5-9B, Qwen3-VL, Qwen3.5-VL
Merged from
Qwen3-8B
Co-occurring entities
More like this (12)
Recent events (27)
Qwen releases Qwen3.5-9B multimodal model on Hugging Face
Qwen has released Qwen3.5-9B, a 9-billion parameter image-text-to-text model, on Hugging Face. The model supports conversational use cases and is compatible with Azure deployment endpoints. With over 9 million downloads and 1,500+ likes, it has seen substantial community uptake.
Qwen releases Qwen3.5-4B multimodal model on Hugging Face
Qwen has released Qwen3.5-4B, a 4-billion parameter image-text-to-text model, on Hugging Face. The model supports conversational use and is compatible with Azure deployment endpoints. With over 10 million downloads and 604 likes, it has seen substantial community uptake.
Language models linearly encode a 'value axis' tracking expected goal success, study finds
Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.
Accelerating Qwen3-8B Agent on Intel Core Ultra with Depth-Pruned Draft Models
Hugging Face and Intel demonstrate speculative decoding acceleration for the Qwen3-8B model on Intel Core Ultra client hardware using depth-pruned draft models. The approach applies structured pruning to create a smaller draft model that enables speculative decoding, targeting on-device agent workloads. This work addresses inference efficiency for mid-size open-weight models on consumer-grade x86 silicon.
Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find
This paper distinguishes two protocols for measuring transformer layer redundancy—replacement (can one layer substitute for another in place?) and interchange (do two layers approximately commute when swapped?)—and shows they can disagree substantially. Experiments on Pythia (410M, 1.4B) and 8B-scale models (Qwen3-8B, Llama-3.1-8B) reveal that the protocol gap grows during training and can change which layers appear safe to prune by several-fold. Notably, Qwen3-8B shows interchange-guided removal is far safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols despite lower interchange KL. The authors recommend scoring both swap-KL metrics before any layer removal or merging, requiring only unlabeled forward passes.
Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning
This paper introduces IH-GRPO, a reinforcement learning algorithm that decouples tool invocation from immediate execution during LLM reasoning, addressing the coherence disruption caused by tight coupling in existing tool-integrated reasoning (TIR) approaches. The authors propose a hierarchical control framework and derive a surrogate loss enabling an implicitly hierarchical policy to match the behavior of an explicit hierarchical policy. Experiments on Qwen3 models (1.7B, 4B, 8B) show absolute improvements of 1.87–2.53% across six out-of-domain mathematical reasoning benchmarks over the strongest baseline. Code is publicly released.
ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs
ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.
AdaCodec: Predictive Visual Coding for Efficient Video MLLMs
AdaCodec introduces a predictive visual code interface for video multimodal large language models that exploits temporal redundancy in video. Instead of encoding every sampled frame as an independent RGB image, it sends full visual tokens only for reference frames with high conditional predictive cost, and encodes inter-frame changes as compact P-tokens. Evaluated against a Qwen3-VL-8B per-frame baseline across eleven benchmarks, AdaCodec at 1/7 the token budget (32k vs 224k tokens) surpasses the baseline on all long-video benchmarks while reducing time-to-first-token from 9.26s to 1.62s.
Alibaba releases Qwen3.5 open-weights vision-language model family with MoE architecture across eight sizes
Alibaba released the Qwen3.5 family of eight open-weights vision-language models ranging from 0.8B to 397B parameters, built on a mixture-of-experts architecture with mixed attention and Gated DeltaNet layers. The flagship Qwen3.5-397B-A17B outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks, while the 9B model surpasses OpenAI's gpt-oss-120B on most language tasks. Open weights are available under Apache 2.0, with hosted agentic variants (Qwen3.5-Plus, Qwen3.5-Flash) available via Alibaba Cloud. The release is notable for strong small-model efficiency and comes amid reported team departures following the Qwen3 rollout.
Qwen3.5 Small tops mobile-sized open models; GPT-5.3 Instant, Gemini 3.1 Flash-Lite, Claude memory import, and LLM deanonymization research
Alibaba released the Qwen3.5 Small model series (0.8B–9B parameters) with a hybrid Gated Delta Networks + sparse MoE architecture, with the 9B model outperforming OpenAI's gpt-oss-120B on GPQA Diamond despite being 13.5x smaller; all weights are Apache 2.0 licensed. Google introduced Gemini 3.1 Flash-Lite, a cost-optimized model at $0.25/M input tokens with 2.5x faster TTFT than Gemini 2.5 Flash. OpenAI released GPT-5.3 Instant targeting conversational quality improvements and hallucination reduction, while Anthropic added memory import/export functionality across all Claude tiers. Separately, researchers from MATS, Anthropic, and ETH Zurich demonstrated that LLM-based pipelines can deanonymize pseudonymous online users at 68% recall/90% precision for $1–4 per profile.
FADA: Unified vision-language model for fetal ultrasound interpretation deployable on consumer smartphones
FADA is a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation of fetal ultrasound images through a single pipeline without requiring external labels at inference. The system distills knowledge from four domain-specific foundation models using selective distillation, achieving 0.8820 mean Dice for segmentation and 0.7671 mAP@0.50 for detection, with expert validation confirming clinically acceptable outputs. Notably, the compressed 0.8B model runs entirely offline on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1) in approximately 60 seconds, targeting diagnostic access gaps in low- and middle-income countries where trained sonographers are scarce. Code, models, and data are publicly released.
RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy
Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.
Qwen3 Release: Flagship 235B MoE and Full Model Family Announced
Alibaba's Qwen team has released Qwen3, a new family of large language models including the flagship Qwen3-235B-A22B mixture-of-experts model. The flagship model claims competitive benchmark performance against DeepSeek-R1, OpenAI o1/o3-mini, Grok-3, and Gemini-2.5-Pro on coding, math, and general capabilities. A smaller MoE variant, Qwen3-30B-A3B, reportedly outperforms QwQ-32B despite using only one-tenth the activated parameters, and the 4B model is said to match Qwen2.5's larger models. Models are available across Hugging Face, ModelScope, and Kaggle.
LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs
LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.
Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference
PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.
Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals
This paper investigates uncertainty quantification (UQ) for activation oracles—systems that make LLM internal activations human-legible—by evaluating 6 confidence estimation methods across 6,000 samples per oracle. The authors find that bootstrap mode frequency achieves the best calibration (ECE 5.7% vs. 25.5% for log-probability baseline on Qwen3-8B), while the log-prob baseline remains useful as a cheap triage signal. Experiments vary verbalizer and context prompts across two Qwen3 model sizes. Code and a patched trainer are released publicly.
Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA
This paper investigates why NLI-based claim checkers used as process rewards in RL-trained medical RAG agents succeed or fail during training. The authors find that a checker's output distribution during training—not its held-out accuracy—determines whether it provides useful gradient signal, with LLM log-probability scoring causing near-total signal collapse (97%+ neutral labels) while a calibrated MedNLI classifier avoids this. A key finding is that stronger checkers can trigger reward hacking cascades (ultra-short answers, search avoidance, language collapse), while moderate-signal local classifiers yield better final model quality (+12% BERTScore over zero-shot). The work frames these as boundary conditions for verifier-as-reward systems in RLVR pipelines.
AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning
This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.
Recursive Language Models Offer Path To Dramatically Expand Beyond the Context Window
MIT researchers Alex L. Zhang, Tim Kraska, and Omar Khattab propose Recursive Language Models (RLMs), a framework that offloads long-context processing to an external Python REPL environment, allowing models to programmatically fetch and manage text chunks via code generation. The root model spawns submodel instances to handle subtasks, aggregating their outputs recursively. Evaluated on benchmarks requiring reasoning over documents up to 11 million tokens, RLMs substantially outperform both base models and competing agentic strategies such as retrieval and summarization agents. For example, RLM-GPT-5 achieved 91.3% on BrowseComp+ versus GPT-5's inability to produce an answer, and ~50% accuracy on OOLONG-PAIRS at 1 million tokens versus near-zero for baseline approaches.
PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards
Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.
Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'
A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.
SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs
Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.
Location metadata causes systematic geographic bias leakage in LLMs, even with 'Unknown' placeholders
Researchers evaluate 'location leakage' — the phenomenon where LLMs generate geographically biased outputs when exposed to location metadata in user profiles, even when prompts are geographically neutral. Across creative writing and Q&A tasks, leakage spikes up to 793x above baseline for models including Llama 3.1-8B, Qwen3-8B, and Claude Sonnet 4.6. A novel structural finding shows that replacing location with 'Unknown' still elevates leakage by up to 72x, indicating the user profile frame itself acts as a conditioning signal independent of geographic content. This has direct implications for AI systems that use user metadata for localization.
Dual-adapter routing system improves knowledge editing precision in LLMs
A new arXiv paper introduces a route-specialized dual-adapter architecture for knowledge editing in LLMs, separating the concerns of writing edits (edit adapter) and suppressing them when irrelevant (locality adapter). A relevance router gates which adapter is applied, addressing the locality problem in memory-assisted editing. Evaluated on CounterFact, zsRE, and MQuAKE benchmarks using Llama-3.1-8B-Instruct and Qwen3-8B, the method achieves best-in-class probability-preference accuracy across all three datasets. Ablations show the gain comes from the architectural separation rather than increased parameter capacity.
HyperTool: Unified executable MCP-style interface reduces step-wise tool call overhead for LLM agents
HyperTool introduces a unified executable interface that allows LLM agents to invoke multiple tool calls within a single code block, hiding intermediate dataflow from the main reasoning trace. This addresses an 'execution-granularity mismatch' where step-wise atomic tool calls waste context and force models to manage low-level operations. On the MCP-Universe benchmark, HyperTool more than doubles accuracy for Qwen3-32B (15.69% → 35.29%) and Qwen3-8B (9.93% → 33.33%), outperforming GPT-OSS and Kimi-k2.5.
BRANE: Natural Language Query-to-Configuration Selection for Retrieval Agents
BRANE is a system that dynamically selects retrieval agent pipeline configurations (LLM, retriever, number of hops, synthesis strategy) at inference time based on per-query characteristics and a cost-quality target. It uses an LLM to extract workload features from each query, then applies lightweight per-configuration predictors to estimate correctness, selecting the configuration that maximizes predicted accuracy penalized by cost. Evaluated on MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE matches best-fixed-configuration accuracy at up to 89% lower cost and outperforms LLM-routing and fine-tuned Qwen3-4B baselines. The work frames per-query pipeline configuration as a practical alternative to static workload-level tuning.