Entity · model

Qwen3 32B

modelactiveqwen3-32b-d905d69c·7 events·first seen Jun 1, 2026

Aliases: Qwen3 32B, Qwen3-VL-32B, Qwen3-32B

Co-occurring entities

More like this (12)

Qwen 2.5 32B Instruct Qwen 3.5 27B Qwen32 Qwen3-30B Qwen1.5-32B Qwen3-14B Qwen3.5 Small Qwen3 Qwen2.5-3B Qwen 3.7 Max Qwen3.5-122B Qwen3-4B

Recent events (7)

6arXiv · cs.CL·3d ago·source ↗

Byte-Prefix Marginalization enables cross-tokenizer on-policy distillation between heterogeneous LLM families

A new arXiv preprint introduces Byte-Prefix Marginalization (BPM), a technique for distilling knowledge from teacher LLMs into a student model when the two use different tokenizers. BPM re-expresses the teacher's next-token distribution in a shared byte space, preserving probability mass and producing a vocabulary-complete alignment target. Evaluated with Qwen3-32B, GLM-Z1-9B-0414, and MiniMax-M2.7 as teachers, BPM improves six-benchmark average scores by 3.7–6.6 points over the strongest cross-tokenizer baselines on math and programming tasks. The method addresses a practical bottleneck in consolidating complementary open-weight models into compact students.

Evaluation and Benchmarking Open Weights Progress Byte-Prefix Marginalization MiniMax Qwen3 32B +1 more

5arXiv · cs.CL·Jul 22, 2026·source ↗

MedDDC-Eval: Diagnosis-Decoupled Evaluation Framework for Multi-Turn Medical Consultation Agents

Researchers introduce MedDDC-Eval, a benchmark framework that decouples history-elicitation quality from diagnosis generation in multi-turn medical consultation agents, using a shared frozen reader to hold the history-to-diagnosis mapping constant. The paper demonstrates that varying only the diagnostic reader shifts diagnosis F1 by 2.2–19.0 points and reverses 18–36% of pairwise policy orderings, exposing confounds in coupled evaluation. The authors also apply Group Relative Policy Optimization (GRPO) to post-train Qwen3-32B using diagnosis-result and trajectory feedback, achieving 9.7 and 4.6 total-score point improvements on two evaluation splits. The work addresses a methodological gap in evaluating medical dialogue agents by enabling controlled attribution of policy quality.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3 32B GRPO (Group Relative Policy Optimization)MedDDC-Eval

7arXiv · cs.LG·Jul 3, 2026·source ↗

Program-as-Weights: compiling natural-language specs into compact neural adapters for local execution

Researchers introduce 'fuzzy-function programming' and its instantiation Program-as-Weights (PAW), a paradigm where a 4B compiler model converts natural-language function specifications into parameter-efficient adapters for a frozen lightweight interpreter. A 0.6B Qwen3 interpreter running PAW programs matches the performance of direct prompting with Qwen3-32B while using ~1/50th the inference memory and running at 30 tokens/s on a MacBook M3. The approach reframes large foundation models as one-time 'tool builders' rather than per-input solvers, targeting tasks like log triage, JSON repair, and intent-based ranking that resist rule-based implementation. The authors also release FuzzyBench, a 10M-example training dataset.

Open Weights Progress Inference Economics Qwen3.5-0.8B Program-as-Weights FuzzyBench +4 more

6arXiv · cs.AI·Jun 24, 2026·source ↗

OpenThoughts-Agent: Open data curation pipeline for broadly capable agentic models

The OpenThoughts-Agent (OT-Agent) project releases a fully open data curation pipeline for training agentic language models, addressing the gap left by prior efforts (SWE-Smith, SERA, Nemotron-Terminal) that target single benchmarks. The team conducts over 100 controlled ablation experiments and assembles a 100K-example training set, fine-tuning Qwen3-32B to achieve 44.8% average accuracy across seven agentic benchmarks — a 3.9 percentage point improvement over the strongest existing open agentic model (Nemotron-Terminal-32B at 40.9%). Training data, pipeline, experimental data, and models are publicly released at openthoughts.ai.

Evaluation and Benchmarking Open Weights Progress Nemotron-Terminal-32B SWE-Smith SERA +4 more

6arXiv · cs.CL·Jun 12, 2026·source ↗

HyperTool: Unified executable MCP-style interface reduces step-wise tool call overhead for LLM agents

HyperTool introduces a unified executable interface that allows LLM agents to invoke multiple tool calls within a single code block, hiding intermediate dataflow from the main reasoning trace. This addresses an 'execution-granularity mismatch' where step-wise atomic tool calls waste context and force models to manage low-level operations. On the MCP-Universe benchmark, HyperTool more than doubles accuracy for Qwen3-32B (15.69% → 35.29%) and Qwen3-8B (9.93% → 33.33%), outperforming GPT-OSS and Kimi-k2.5.

Inference Economics Agent and Tool Ecosystem GPT-OSS MCP-Universe HyperTool +4 more

6arXiv · cs.CL·Jun 10, 2026·source ↗

HiViG: History-aware visually grounded critic improves computer use agents across GUI benchmarks

Researchers introduce HiViG, a test-time framework for Computer Use Agents that addresses two weaknesses in existing critic models: short-sighted decision loops and lack of visual grounding. The system trains a multimodal critic on real GUI trajectories to maintain a compact macro-action history and verify execution coordinates against live screenshots before action execution. Evaluated on web, mobile, and desktop benchmarks, HiViG improves average success rates by 5.8% over the strongest baseline with Qwen3-VL-32B and 9.0% with Gemini-3-Flash, with both history and grounding components shown to be independently necessary.

Evaluation and Benchmarking Agent and Tool Ecosystem HiViG A History-Aware Visually Grounded Critic for Computer Use Agents Gemini 3 Flash +2 more

6The Batch·Jun 1, 2026·source ↗

Activation Capping Technique Stabilizes LLM Assistant Personas Against Drift and Jailbreaks

Researchers from MATS, Oxford, and Anthropic introduced the 'assistant axis,' a vector derived from LLM layer outputs that quantifies how closely a model adheres to its trained assistant persona. They developed 'activation capping,' an inference-time method that corrects deviations from this axis when similarity falls below a threshold. Testing on Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B showed harmful response rates to jailbreak prompts dropped by roughly half (e.g., 83% to 41% for Qwen3 32B) without degrading benchmark performance. The technique targets character-based jailbreaks that bypass system prompts by manipulating a model's internal representational state.

Evaluation and Benchmarking AI Safety Research Gemma 2 9B assistant axis Llama 3.1 70B +12 more