Entity · model

Qwen3-4B

modelactiveqwen3-4b-7738899d·52 events·first seen May 18, 2026

Aliases: Qwen3-4B, Qwen3-8B, Qwen3-VL-8B, Qwen3.5-4B, Qwen3.5-9B, Qwen3-VL, Qwen3.5-VL, Qwen3-VL 8B, Qwen3-VL 4B, Qwen3.5 4B, Qwen3.5 9B

Merged from

Qwen3-8B

Co-occurring entities

More like this (12)

Qwen3 Qwen3-235B Qwen3-4B-Base Qwen3-30B-A3B Qwen1.5-7B Qwen 2.5-7B Qwen3-14B Qwen3-8B-Base Qwen3-14B-Base Qwen-7B Qwen 3.7 Qwen 3.5

Guides (1)

Qwen3-4B

Qwen3-4B: Alibaba's Compact Open-Weight Model That Punches Above Its Size

Read asBeginner In-depth

Recent events (50)

All 52 events →

6arXiv · cs.AI·18h ago·source ↗

Empirical study finds inference-time scaling yields diminishing returns for local computer-use agents

Researchers present a systematic empirical study of inference-time scaling across four dimensions (contextual, temporal, structural, parallel) for locally-deployed computer-use agents under hardware constraints. Evaluating Qwen3-VL-8B/30B-A3B, UI-TARS-1.5-7B, and OpenCUA-7B on OSWorld, they find that additional compute often shifts rather than eliminates failure modes—contextual scaling saturates, temporal scaling extends erroneous trajectories, and structural decomposition adds overhead. The findings argue for selective compute allocation and failure-aware control mechanisms tailored to local model capabilities.

Evaluation and Benchmarking Inference Economics Qwen3-30B-A3B Qwen3-4B OpenCUA-7B +4 more

5arXiv · cs.CL·41h ago·source ↗

Multilingual study finds LLMs are not uniformly robust to non-canonical tokenizations, with up to 23.7% performance drops

A new arXiv paper investigates how language models behave when given alternative (non-canonical) tokenizations of the same input string across 27 languages and six downstream tasks. While prior work showed English models are largely invariant to such perturbations, the study finds this does not generalize: Llama-3.1-8B drops 23.7% on average, Qwen3-8B 11.4%, and Gemma-3-12B 9.9% in relative performance. Languages with higher token fragmentation are systematically more sensitive, and the authors show LoRA fine-tuning on multi-tokenization data—including English-only data—provides meaningful mitigation.

Evaluation and Benchmarking LoRA Qwen3-4B Gemma 3 12B Instruct +1 more

4arXiv · cs.CL·41h ago·source ↗

Linear readouts of LLM hidden states decode causal reasoning about diagnostic evidence

Researchers introduce a paired-prompt benchmark testing whether language models can correctly match diagnostic evidence to causal claims that vary by population, estimand, or identifying assumptions — a task where surface-level cues can mislead. Using linear probes on final-token hidden states from Qwen2.5-7B, Qwen3-8B, and Llama-3.1-8B, they find balanced accuracy of 0.654–0.659 on a 49-pair benchmark spanning nine diagnostic families, exceeding permutation nulls and text-only baselines. The key finding is that hidden states contain linearly decodable information about causal relevance that is not fully captured by output logits or surface features.

Evaluation and Benchmarking Qwen2.5-7B-Instruct-1M Llama3-8B-Instruct Same Evidence, Different Target: Decoding How Diagnostic Evidence Bears on Causal Questions from Language-Model States +1 more

5arXiv · cs.CL·2d ago·source ↗

MemSFT: Plug-and-play parametric memory module mitigates alignment tax in domain-specialized LLMs

Researchers propose MemSFT, a method that decouples domain specialization from backbone parameter updates by training a plug-and-play parametric memory module to imitate a non-parametric retriever over domain data. A learned router dynamically fuses the memory and backbone output distributions at each decoding step, allowing selective invocation of domain expertise. Evaluated across biology, geoscience, and law on models from Qwen3-8B to Qwen3-235B-A22B, MemSFT consistently improves domain performance with negligible general-task degradation, whereas full SFT causes severe catastrophic forgetting. The memory module is reusable across LLM sizes, offering a practical path to modular domain specialization.

Enterprise Deployment Patterns Alignment and RLHF MemSFT Qwen3-4B Qwen3-235B

6arXiv · cs.CL·4d ago·source ↗

Nanbeige4.2-3B: A 3B-parameter agentic model outperforming 9B–12B models on agentic benchmarks

Researchers present Nanbeige4.2-3B, a 3-billion non-embedding parameter model designed for agentic tasks including code-agent, office-agent, and complex tool use. The model is pretrained on 28T tokens using a Looped Transformer architecture that reuses layer stacks to increase capacity without adding parameters, and trained with a multi-stage RL pipeline combining mixed-mode RLHF, length-controlled reasoning RL, and agentic RL with outcome and process rewards. Evaluations claim Nanbeige4.2-3B outperforms larger models including Qwen3.5-9B and Gemma4-12B on diverse agentic benchmarks while remaining competitive on reasoning tasks. The result contributes to the ongoing question of how much agentic capability can be packed into compact, locally-deployable models.

Open Weights Progress Inference Economics Gemma-4 E4B-it Looped Transformer OpenClaw +3 more

5arXiv · cs.AI·Jul 24, 2026·source ↗

Visual Contrastive Self-Distillation (VCSD) improves multimodal LLM training without external teachers

Researchers propose Visual Contrastive Self-Distillation (VCSD), a training method for vision-language models that creates an on-policy self-distillation signal purely through input conditioning — specifically by contrasting token distributions from an EMA teacher conditioned on the original image versus a content-erased control. The approach eliminates the need for external teachers, privileged answers, visual evidence signals, or reasoning traces. Evaluated on Qwen3-VL and Qwen3.5 models using the ViRL39K dataset, VCSD consistently outperforms matched on-policy self-distillation baselines, with aggregate benchmark gains of up to ~3.75 percentage points at the 8B scale.

Alignment and RLHF Multimodal Progress Visual Contrastive Self-Distillation ViRL39K Qwen3-4B +1 more

5arXiv · cs.CL·Jul 23, 2026·source ↗

SelectBench and DAPO post-training for selective evidence adoption in RAG contexts

Researchers introduce SelectBench, a benchmark and training set for evaluating whether retrieval-augmented LLMs can selectively adopt valid evidence while rejecting misleading or injected content. They post-train Qwen3.5-4B using DAPO with rule-based and semantic judge rewards, achieving modest but directional improvements on SelectBench-v2 (22.46% to 26.46% strict success). Gains do not survive Holm multiple-comparison correction, and prompt-injection resistance shows no improvement, leaving statistical robustness and injection resistance as open challenges. General capabilities on MMLU and HotpotQA are preserved.

Evaluation and Benchmarking AI Safety Research SelectBench DAPO DeepSeek V4 +3 more

4arXiv · cs.CL·Jul 22, 2026·source ↗

Reinforcement learning with verifiable rewards improves small-model legal machine translation

Researchers evaluate multiple training paradigms for legal machine translation, comparing supervised fine-tuning and reinforcement learning with verifiable rewards (RLVR) on small models (Qwen3.5 4B/9B, Gemma 3 12B) against frontier reasoning models. Using the Swiss multilingual legal system as a testbed, they find RLVR outperforms SFT for legal NMT and brings small models close to frontier performance, though a gap remains. The study also observes diminishing returns from re-training as model size increases. Code and models are publicly released.

Evaluation and Benchmarking Alignment and RLHF Reasoning Before Translation: Enhancing Legal Machine Translation with Structured Reasoning Qwen3-4B Gemma 3 12B Instruct

5arXiv · cs.CL·Jul 21, 2026·source ↗

Soft prefixes can override correct logical judgments in LLMs, revealing model-specific stability limits

Researchers use learned soft prefixes (opaque continuous vectors) to probe the logical stability of LLMs on syllogistic reasoning benchmarks without modifying model weights. Across Qwen3.6-35B-A3B MoE, Qwen3-8B, and Gemma 4 31B, successful prefixes redirect correct answers at flip rates of 72–90% for Qwen3.6 MoE and 54–56% for Gemma, far exceeding random controls by 37–99 percentage points. The dominant effect is a broad answer-preference bias rather than symbol-level forcing, and the bias generalizes across unseen logical forms and prompt interfaces. Model-specific differences in how this bias manifests suggest substantial variation in logical robustness across architectures.

Evaluation and Benchmarking AI Safety Research Qwen3.6-35B-A3B soft prefix Google +4 more

6arXiv · cs.AI·Jul 20, 2026·source ↗

Model merging matches joint multi-task RL training on AppWorld benchmark, explained by near-orthogonal task vectors

A new arXiv paper provides the first direct comparison of model merging versus joint multi-task reinforcement learning training, using Qwen3-8B specialists trained on the AppWorld agent benchmark with the LOOP algorithm. Merging methods (TIES, RAM+) statistically match jointly trained models on task-goal completion. The authors explain this via task vector geometry: specialist task vectors are near-orthogonal (cosine similarity 0.06–0.10) despite ~65% parameter support overlap, causing sign- and support-based merging methods to collapse to near-uniform averaging.

Evaluation and Benchmarking Agent and Tool Ecosystem RAM+When Model Merging Rivals Joint Multi-Task Reinforcement Learning: A Task-Vector Geometry Analysis TIES +4 more

6arXiv · cs.CL·Jul 15, 2026·source ↗

Function-Aware Fill-in-the-Middle Mid-Training Improves Coding Agent Foundation Models

Researchers propose a self-supervised mid-training objective called function-aware fill-in-the-middle (FIM) that exploits the structural isomorphism between a coding agent's action-observation-continuation loop and function call sites in ordinary code. Applied to Qwen2.5-Coder-Instruct (7B/14B) and Qwen3-8B on a 2.6B-token GitHub corpus, the method yields +2.8 to +5.4 point gains on SWE-Bench-Verified and SWE-Bench-Lite across multiple post-training pipelines. Notably, the technique also mitigates capability erosion on non-agent coding and tool-use benchmarks, suggesting the function-call inductive bias generalizes beyond the training domain.

Frontier Model Releases Evaluation and Benchmarking SWE-Smith SWE-Bench Lite Qwen2.5-Coder-32B-Instruct +8 more

5arXiv · cs.CL·Jul 14, 2026·source ↗

MET: Theory-grounded multilingual moral reasoning with cultural adaptation and self-distillation

Researchers introduce three contributions to address gaps in multilingual moral reasoning for LLMs: MCLASH, a culturally situated multilingual moral decision-making benchmark; MET, a two-step prompting method grounded in psychological and philosophical moral theory; and MET-D, a self-distillation training variant requiring no external supervision. MET-D improves macro-F1 by an average of 3.71 points on MCLASH and 4.23 on MMoralExceptQA across Qwen3-4B, Qwen3-8B, and Gemma3-4B, with a peak gain of 12.94 points for Malay on Qwen3-8B. The work also finds that native-language reasoning increases by 62 points on average and that beneficial moral grounds differ systematically across cultures.

Evaluation and Benchmarking Alignment and RLHF MET: Theory-Grounded and Culture-Aware Multilingual Moral Reasoning Gemma-3-4B-IT Qwen3-4B +2 more

5arXiv · cs.CL·Jul 13, 2026·source ↗

Test-time scaling for small VLMs on multilingual visual MCQ: conditions matter more than methods

A new arXiv paper examines whether test-time scaling (TTS) transfers to small open vision-language models using EXAMS-V, a multilingual visual multiple-choice benchmark. The study compares self-consistency, describe-then-reason with PRM-guided beam search, and post-hoc selectors across Qwen2.5-VL-7B-Instruct and Qwen3.5-4B. Key findings: prompt parseability and decoding budget (token limit) dominate gains, while elaborate search/verification methods like PRM-guided beam search underperform plain majority vote at 8x the cost. The best configuration achieves 84.1% on ImageCLEF 2026 test split, ranking first on the Visual MCQ leaderboard.

Evaluation and Benchmarking Inference Economics ImageCLEF 2026 Test-Time Scaling for Small VLMs on Multilingual Visual MCQ Qwen2.5-7B-Instruct-1M +3 more

6The Batch·Jul 10, 2026·source ↗

Brain2Qwerty v2 translates MEG brain waves to text with 39% word error rate

Researchers from Meta and several French and Spanish institutions released Brain2Qwerty v2, a non-invasive brain-computer interface system that decodes magnetoencephalography (MEG) signals into text using a CNN/conformer encoder, a word-aligner, and a fine-tuned Qwen3-4B language model with per-subject LoRA adapters. The system achieves a 39% word error rate on 9 subjects, down from 43% in v1, trained on 90 hours of MEG recordings. A notable finding is that cross-subject training substantially outperforms single-subject training, suggesting a data-scaling dynamic analogous to LLM pretraining. Training code and v1 data have been open-sourced.

Evaluation and Benchmarking Multimodal Progress French National Centre for Scientific Research Basque Center on Cognition, Brain, and Language Qwen3-4B +4 more

5arXiv · cs.CL·Jul 10, 2026·source ↗

DominoTree: Training-free conditional tree drafting for speculative decoding achieves 6.6x LLM inference speedup

DominoTree is a new training-free speculative decoding method that constructs best-first draft trees scored by Domino's conditional, non-factorized GRU-based correction along each root-to-node path. On Qwen3-4B across eight benchmarks, it achieves up to 6.6x speedup over autoregressive decoding and a mean accept length of up to 10.7 tokens per round, outperforming prior methods including DDTree, CaDDTree, DFlash, and the base Domino decoder. A GPU-native CUDA-graph tree builder provides 9-10% throughput gains over Domino overall, with up to +22% on Alpaca, while maintaining bit-identical acceptance behavior.

Inference Economics DOMINO DDTree speculative decoding +3 more

5The Batch·Jul 3, 2026·source ↗

RoboReward: Vision-Language Reward Models for Robot Training via RL

Researchers at Stanford and UC Berkeley developed RoboReward, a family of 4B and 8B vision-language reward models designed to provide reward signals for robot reinforcement learning across diverse robot types and tasks. The team built a novel dataset by augmenting successful robot demonstrations with synthetically generated failure examples using GPT-5 mini and Qwen3-4B, then fine-tuned Qwen3-VL models to predict task progress scores. RoboReward 8B outperformed GPT-5, GPT-5 mini, and Gemini Robotics-ER 1.5 on the new RoboRewardBench evaluation, and in real-world robot trials substantially exceeded prior reward model baselines while still falling short of human-assigned rewards. The authors also release RoboRewardBench as a community benchmark for reward model evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem DeepLearning.AI Stanford University UC Berkeley +12 more

5arXiv · cs.AI·Jul 3, 2026·source ↗

ReContext: Training-free recursive evidence replay improves LLM long-context reasoning

Researchers introduce RECONTEXT, a training-free inference-time method for improving long-context reasoning in LLMs. The approach uses model-internal relevance signals to build a query-conditioned evidence pool that is replayed before final generation, without modifying the original context, external memory, or context pruning. Experiments across eight long-context datasets at 128K context length show consistent improvements on Qwen3-4B, Qwen3-8B, and Llama3-8B. The authors provide a theoretical grounding via associative memory theory, framing attention as cue-trace association and replay as trace reactivation.

Long Context Evolution Agent and Tool Ecosystem Llama3-8B Yanjun Zhao Qwen3-4B +1 more

5arXiv · cs.CL·Jun 29, 2026·source ↗

NLL-guided training-free method selects optimal full-attention layers for efficient long-context inference

Researchers propose NLL-guided layer selection, a training-free technique for hybrid attention models that identifies which layers should use full versus sliding-window attention by measuring negative log-likelihood degradation on answer tokens. On LongMemEval with Qwen3-4B, the method achieves 64.6% accuracy using only 1/4 full-attention layers, matching a 1/2-FA periodic baseline while halving compute, and outperforming a periodic 1/4-FA baseline by 10.4 percentage points. The calibration procedure requires approximately 15 minutes of one-time compute, making it practical for deployment. The work advances the efficiency-accuracy tradeoff for long-context LLM inference without requiring any retraining.

Long Context Evolution Inference Economics LongMemEval Qwen3-4B NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation +1 more

3Deepseek·Jun 28, 2026·source ↗

DeepSeek releases EAGLE3 speculative decoding draft model for Qwen3-4B

DeepSeek published eagle3_qwen3_4b_ttt7 on Hugging Face, a draft model for EAGLE3 speculative decoding targeting the Qwen3-4B base model. EAGLE3 is DeepSeek's third-generation speculative decoding framework designed to accelerate inference by predicting future tokens with a lightweight draft model. The release is a narrow inference-optimization artifact with zero downloads and likes at time of indexing, suggesting it is very fresh or experimental.

Open Weights Progress Inference Economics Eagle3 DeepSeek V4 Qwen3-4B +2 more

3Deepseek·Jun 28, 2026·source ↗

DeepSeek releases EAGLE3 speculative decoding draft model for Qwen3-8B

DeepSeek published eagle3_qwen3_8b_ttt7 on Hugging Face, a draft model for EAGLE3 speculative decoding targeting the Qwen3-8B base model. EAGLE3 is DeepSeek's third-generation speculative decoding framework designed to accelerate inference by predicting future tokens with a lightweight draft head. The release is a narrow inference optimization artifact with minimal engagement at time of indexing.

Inference Economics Eagle3 DeepSeek V4 Qwen3-4B +2 more

6arXiv · cs.LG·Jun 26, 2026·source ↗

RiVER framework enables RL training of LLMs on tasks without ground-truth solutions

Researchers introduce RiVER (Ranking-induced VERifiable framework), a reinforcement learning approach that trains LLMs on score-based optimization tasks using deterministic execution feedback as continuous rewards, without requiring ground-truth answers. The method addresses two failure modes in group-relative RL with continuous rewards—scale dominance and frequency dominance—via calibrated, instance-wise reward shaping. Applied to Qwen3-8B and GLM-Z1-9B-0414 on competitive programming tasks, RiVER improves ALE rating rank by ~9% and also transfers to exact-solution benchmarks (LiveCodeBench, USACO) with 2-4% absolute gains, unlike raw-score baselines. The result suggests score-based heuristic tasks can serve as general-purpose RL training environments for coding ability.

Evaluation and Benchmarking Alignment and RLHF USACO Qwen3-4B LiveCodeBench +3 more

6arXiv · cs.CL·Jun 25, 2026·source ↗

OPERA: Perplexity-based RL alignment for open-ended reasoning tasks

OPERA (Objective Perplexity-based Reflective Alignment) proposes replacing LLM-as-a-judge reward models with intrinsic rewards derived from perplexity dynamics to stabilize RL training on open-ended tasks like creative writing. The method includes a cold-start data synthesis pipeline generating 20,000 reasoning trajectories using perplexity-prioritized rollouts. Applied to Qwen3-8B, OPERA claims state-of-the-art among open-source models on open-ended tasks, reportedly matching or exceeding Gemini 2.5 and MiniMax-M2.5 on some benchmarks.

Open Weights Progress Alignment and RLHF OPERA Gemini 2.5 MiniMax +1 more

4arXiv · cs.CL·Jun 19, 2026·source ↗

STAGE pipeline generates source-grounded training data for text-to-JSON extraction

Researchers introduce STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a data generation pipeline that uses LLMs to synthesize training data for structured extraction from long unstructured documents, validating outputs against underlying spreadsheets. Evaluated on STAGE-Eval, an 851-example benchmark, the pipeline substantially improves Qwen3-4B performance, raising exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%. The work targets a practical bottleneck in enterprise document processing: reliably converting financial filings and clinical records into machine-readable JSON.

Evaluation and Benchmarking Enterprise Deployment Patterns STAGE Qwen3-4B STAGE-Eval

4arXiv · cs.CL·Jun 18, 2026·source ↗

G-IdiomAlign: Gloss-pivoted benchmark for cross-lingual idiom alignment in LLMs

Researchers introduce G-IdiomAlign, a benchmark anchoring idioms via English glosses from Wiktionary to evaluate cross-lingual idiom equivalence in LLMs. The benchmark supports two evaluation protocols: a multiple-choice task with typed distractors and a gloss-contrastive generation task isolating the effect of explicit semantic pivots. Experiments across diverse LLMs find that literal translation bias is the dominant failure mode, especially for low-resource languages, and that gloss conditioning improves performance but leaves substantial headroom. Mechanistic analysis on Qwen3-8B suggests cross-condition differences are concentrated in attention heads rather than layers.

Evaluation and Benchmarking Qwen3-4B G-IdiomAlign Wiktionary

6arXiv · cs.CL·Jun 18, 2026·source ↗

DreamReasoner-8B: Block-size curriculum learning enables long-CoT reasoning in diffusion language models

Researchers introduce DreamReasoner-8B, an open-source block diffusion language model trained with a block-size curriculum learning strategy that gradually transitions from fine-grained to coarse-grained block sizes during training. The work identifies a critical failure mode: training with large block sizes severely degrades reasoning, while small block sizes preserve it. The proposed curriculum bridges this gap, achieving math and code reasoning performance competitive with Qwen3-8B while retaining the parallel decoding efficiency of block diffusion models. The model and code are publicly released.

Frontier Model Releases Open Weights Progress Qwen3-4B Block-Size Curriculum Learning for Diffusion Reasoning Models Block-Size Curriculum Learning +3 more

6arXiv · cs.CL·Jun 17, 2026·source ↗

Location metadata causes systematic geographic bias leakage in LLMs, even with 'Unknown' placeholders

Researchers evaluate 'location leakage' — the phenomenon where LLMs generate geographically biased outputs when exposed to location metadata in user profiles, even when prompts are geographically neutral. Across creative writing and Q&A tasks, leakage spikes up to 793x above baseline for models including Llama 3.1-8B, Qwen3-8B, and Claude Sonnet 4.6. A novel structural finding shows that replacing location with 'Unknown' still elevates leakage by up to 72x, indicating the user profile frame itself acts as a conditioning signal independent of geographic content. This has direct implications for AI systems that use user metadata for localization.

Evaluation and Benchmarking AI Safety Research Claude Sonnet 4 Alibaba Qwen3-4B +4 more

7arXiv · cs.CL·Jun 16, 2026·source ↗

Language models linearly encode a 'value axis' tracking expected goal success, study finds

Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.

AI Safety Research Alignment and RLHF The Value Axis: Language Models Encode Whether They're on the Right Track Direct Preference Optimization (DPO)Qwen3-4B

4arXiv · cs.LG·Jun 15, 2026·source ↗

Dual-adapter routing system improves knowledge editing precision in LLMs

A new arXiv paper introduces a route-specialized dual-adapter architecture for knowledge editing in LLMs, separating the concerns of writing edits (edit adapter) and suppressing them when irrelevant (locality adapter). A relevance router gates which adapter is applied, addressing the locality problem in memory-assisted editing. Evaluated on CounterFact, zsRE, and MQuAKE benchmarks using Llama-3.1-8B-Instruct and Qwen3-8B, the method achieves best-in-class probability-preference accuracy across all three datasets. Ablations show the gain comes from the architectural separation rather than increased parameter capacity.

Evaluation and Benchmarking Alignment and RLHF BGE Llama3-8B-Instruct Qwen3-4B +4 more

6arXiv · cs.CL·Jun 12, 2026·source ↗

HyperTool: Unified executable MCP-style interface reduces step-wise tool call overhead for LLM agents

HyperTool introduces a unified executable interface that allows LLM agents to invoke multiple tool calls within a single code block, hiding intermediate dataflow from the main reasoning trace. This addresses an 'execution-granularity mismatch' where step-wise atomic tool calls waste context and force models to manage low-level operations. On the MCP-Universe benchmark, HyperTool more than doubles accuracy for Qwen3-32B (15.69% → 35.29%) and Qwen3-8B (9.93% → 33.33%), outperforming GPT-OSS and Kimi-k2.5.

Inference Economics Agent and Tool Ecosystem GPT-OSS MCP-Universe HyperTool +4 more

6arXiv · cs.AI·Jun 12, 2026·source ↗

RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy

Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.

Evaluation and Benchmarking Alignment and RLHF RA-RFT Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning GRPO +3 more

6arXiv · cs.AI·Jun 10, 2026·source ↗

FADA: Unified vision-language model for fetal ultrasound interpretation deployable on consumer smartphones

FADA is a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation of fetal ultrasound images through a single pipeline without requiring external labels at inference. The system distills knowledge from four domain-specific foundation models using selective distillation, achieving 0.8820 mean Dice for segmentation and 0.7671 mAP@0.50 for detection, with expert validation confirming clinically acceptable outputs. Notably, the compressed 0.8B model runs entirely offline on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1) in approximately 60 seconds, targeting diagnostic access gaps in low- and middle-income countries where trained sonographers are scarce. Code, models, and data are publicly released.

Inference Economics Multimodal Progress USF-MAE FetalCLIP Qwen3-4B +4 more

4arXiv · cs.LG·Jun 8, 2026·source ↗

SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs

Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.

Evaluation and Benchmarking Open Weights Progress LLaMA-7B Qwen3-4B Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning +1 more

6Qwen·Jun 5, 2026·source ↗

Qwen releases Qwen3.5-9B multimodal model on Hugging Face

Qwen has released Qwen3.5-9B, a 9-billion parameter image-text-to-text model, on Hugging Face. The model supports conversational use cases and is compatible with Azure deployment endpoints. With over 9 million downloads and 1,500+ likes, it has seen substantial community uptake.

Frontier Model Releases Open Weights Progress Microsoft Azure Qwen3-4B Qwen +2 more

6Qwen·Jun 5, 2026·source ↗

Qwen releases Qwen3.5-4B multimodal model on Hugging Face

Qwen has released Qwen3.5-4B, a 4-billion parameter image-text-to-text model, on Hugging Face. The model supports conversational use and is compatible with Azure deployment endpoints. With over 10 million downloads and 604 likes, it has seen substantial community uptake.

Open Weights Progress Multimodal Progress Microsoft Azure Qwen3-4B Qwen +1 more

6arXiv · cs.CL·Jun 3, 2026·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

6The Batch·Jun 3, 2026·source ↗

Qwen3.5 Small tops mobile-sized open models; GPT-5.3 Instant, Gemini 3.1 Flash-Lite, Claude memory import, and LLM deanonymization research

Alibaba released the Qwen3.5 Small model series (0.8B–9B parameters) with a hybrid Gated Delta Networks + sparse MoE architecture, with the 9B model outperforming OpenAI's gpt-oss-120B on GPQA Diamond despite being 13.5x smaller; all weights are Apache 2.0 licensed. Google introduced Gemini 3.1 Flash-Lite, a cost-optimized model at $0.25/M input tokens with 2.5x faster TTFT than Gemini 2.5 Flash. OpenAI released GPT-5.3 Instant targeting conversational quality improvements and hallucination reduction, while Anthropic added memory import/export functionality across all Claude tiers. Separately, researchers from MATS, Anthropic, and ETH Zurich demonstrated that LLM-based pipelines can deanonymize pseudonymous online users at 68% recall/90% precision for $1–4 per profile.

Frontier Model Releases Open Weights Progress Claude Google Alibaba +14 more

7arXiv · cs.CL·Jun 3, 2026·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO Qwen3-4B +7 more

7The Batch·Jun 2, 2026·source ↗

Alibaba releases Qwen3.5 open-weights vision-language model family with MoE architecture across eight sizes

Alibaba released the Qwen3.5 family of eight open-weights vision-language models ranging from 0.8B to 397B parameters, built on a mixture-of-experts architecture with mixed attention and Gated DeltaNet layers. The flagship Qwen3.5-397B-A17B outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks, while the 9B model surpasses OpenAI's gpt-oss-120B on most language tasks. Open weights are available under Apache 2.0, with hosted agentic variants (Qwen3.5-Plus, Qwen3.5-Flash) available via Alibaba Cloud. The release is notable for strong small-model efficiency and comes amid reported team departures following the Qwen3 rollout.

Frontier Model Releases Open Weights Progress GPT-5.2 Alibaba Cloud Model Studio Claude Opus 4.6 +10 more

7The Batch·Jun 2, 2026·source ↗

Recursive Language Models Offer Path To Dramatically Expand Beyond the Context Window

MIT researchers Alex L. Zhang, Tim Kraska, and Omar Khattab propose Recursive Language Models (RLMs), a framework that offloads long-context processing to an external Python REPL environment, allowing models to programmatically fetch and manage text chunks via code generation. The root model spawns submodel instances to handle subtasks, aggregating their outputs recursively. Evaluated on benchmarks requiring reasoning over documents up to 11 million tokens, RLMs substantially outperform both base models and competing agentic strategies such as retrieval and summarization agents. For example, RLM-GPT-5 achieved 91.3% on BrowseComp+ versus GPT-5's inability to produce an answer, and ~50% accuracy on OOLONG-PAIRS at 1 million tokens versus near-zero for baseline approaches.

Long Context Evolution Evaluation and Benchmarking MIT OOLONG-PAIRS Tim Kraska +9 more

7arXiv · cs.CL·Jun 2, 2026·source ↗

AdaCodec: Predictive Visual Coding for Efficient Video MLLMs

AdaCodec introduces a predictive visual code interface for video multimodal large language models that exploits temporal redundancy in video. Instead of encoding every sampled frame as an independent RGB image, it sends full visual tokens only for reference frames with high conditional predictive cost, and encodes inter-frame changes as compact P-tokens. Evaluated against a Qwen3-VL-8B per-frame baseline across eleven benchmarks, AdaCodec at 1/7 the token budget (32k vs 224k tokens) surpasses the baseline on all long-video benchmarks while reducing time-to-first-token from 9.26s to 1.62s.

Long Context Evolution Frontier Model Releases Multimodal Large Language Models Qwen3-4B predictive visual code +4 more

6arXiv · cs.CL·May 29, 2026·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

7arXiv · cs.CL·May 28, 2026·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more

6arXiv · cs.CL·May 27, 2026·source ↗

Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference

PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.

Long Context Evolution Inference Economics On-Policy Distillation (OPD)Multi-Token Prediction (MTP)speculative decoding +7 more

6arXiv · cs.AI·May 27, 2026·source ↗

BRANE: Natural Language Query-to-Configuration Selection for Retrieval Agents

BRANE is a system that dynamically selects retrieval agent pipeline configurations (LLM, retriever, number of hops, synthesis strategy) at inference time based on per-query characteristics and a cost-quality target. It uses an LLM to extract workload features from each query, then applies lightweight per-configuration predictors to estimate correctness, selecting the configuration that maximizes predicted accuracy penalized by cost. Evaluated on MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE matches best-fixed-configuration accuracy at up to 89% lower cost and outperforms LLM-routing and fine-tuned Qwen3-4B baselines. The work frames per-query pipeline configuration as a practical alternative to static workload-level tuning.

Evaluation and Benchmarking Inference Economics BrowseComp-Plus MuSiQue Qwen3-4B +4 more

6arXiv · cs.CL·May 26, 2026·source ↗

Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

This paper investigates why NLI-based claim checkers used as process rewards in RL-trained medical RAG agents succeed or fail during training. The authors find that a checker's output distribution during training—not its held-out accuracy—determines whether it provides useful gradient signal, with LLM log-probability scoring causing near-total signal collapse (97%+ neutral labels) while a calibrated MedNLI classifier avoids this. A key finding is that stronger checkers can trigger reward hacking cascades (ultra-short answers, search avoidance, language collapse), while moderate-signal local classifiers yield better final model quality (+12% BERTScore over zero-shot). The work frames these as boundary conditions for verifier-as-reward systems in RLVR pipelines.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO reward hacking +8 more

5arXiv · cs.CL·May 26, 2026·source ↗

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

This paper investigates uncertainty quantification (UQ) for activation oracles—systems that make LLM internal activations human-legible—by evaluating 6 confidence estimation methods across 6,000 samples per oracle. The authors find that bootstrap mode frequency achieves the best calibration (ECE 5.7% vs. 25.5% for log-probability baseline on Qwen3-8B), while the log-prob baseline remains useful as a cheap triage signal. Experiments vary verbalizer and context prompts across two Qwen3 model sizes. Code and a patched trainer are released publicly.

Evaluation and Benchmarking AI Safety Research Expected Calibration Error Activation Oracles Qwen3-4B +4 more

6arXiv · cs.AI·May 25, 2026·source ↗

ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs

ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.

Agent and Tool Ecosystem Alignment and RLHF Reasoning Enhancement Qwen3-4B ETCHR +5 more

5arXiv · cs.CL·May 21, 2026·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more

4Hugging Face Blog·May 19, 2026·source ↗

Accelerating Qwen3-8B Agent on Intel Core Ultra with Depth-Pruned Draft Models

Hugging Face and Intel demonstrate speculative decoding acceleration for the Qwen3-8B model on Intel Core Ultra client hardware using depth-pruned draft models. The approach applies structured pruning to create a smaller draft model that enables speculative decoding, targeting on-device agent workloads. This work addresses inference efficiency for mid-size open-weight models on consumer-grade x86 silicon.

Open Weights Progress Inference Economics speculative decoding Qwen3-4B Hugging Face +4 more

6arXiv · cs.CL·May 19, 2026·source ↗

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

This paper introduces IH-GRPO, a reinforcement learning algorithm that decouples tool invocation from immediate execution during LLM reasoning, addressing the coherence disruption caused by tight coupling in existing tool-integrated reasoning (TIR) approaches. The authors propose a hierarchical control framework and derive a surrogate loss enabling an implicitly hierarchical policy to match the behavior of an explicit hierarchical policy. Experiments on Qwen3 models (1.7B, 4B, 8B) show absolute improvements of 1.87–2.53% across six out-of-domain mathematical reasoning benchmarks over the strongest baseline. Code is publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem GRPO Tool-Integrated Reasoning Qwen3-4B +3 more