Entity · model

Qwen2.5-7B-Instruct-1M

modelactiveqwen2-5-7b-instruct-1m-be09a203·13 events·first seen May 18, 2026

Aliases: Qwen2.5-7B-Instruct-1M, Qwen2.5-14B-Instruct-1M, Qwen2.5-7B-Instruct, Qwen2.5-14B-Instruct, qwen2.5-14B-Instruct, Qwen2.5-1.5B-Instruct, Qwen2-VL 7B Instruct, Qwen2.5-72B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen2.5-0.5B-Instruct, Qwen2-VL-2B-Instruct

Co-occurring entities

More like this (12)

Qwen2.5-VL-32B-Instruct Qwen2.5-Coder-32B-Instruct Qwen3-4B-Instruct Qwen3-30B-A3B-Instruct Qwen3-Coder-480B-A35B-Instruct Qwen2-Audio-7B-Instruct Qwen2-Math-Instruct Qwen 2.5-7B Qwen1.5-32B Qwen2.5-7B Qwen1.5-MoE-A2.7B Qwen2.5-8B

Recent events (13)

4arXiv · cs.CL·44h ago·source ↗

Linear readouts of LLM hidden states decode causal reasoning about diagnostic evidence

Researchers introduce a paired-prompt benchmark testing whether language models can correctly match diagnostic evidence to causal claims that vary by population, estimand, or identifying assumptions — a task where surface-level cues can mislead. Using linear probes on final-token hidden states from Qwen2.5-7B, Qwen3-8B, and Llama-3.1-8B, they find balanced accuracy of 0.654–0.659 on a 49-pair benchmark spanning nine diagnostic families, exceeding permutation nulls and text-only baselines. The key finding is that hidden states contain linearly decodable information about causal relevance that is not fully captured by output logits or surface features.

Evaluation and Benchmarking Qwen2.5-7B-Instruct-1M Llama3-8B-Instruct Same Evidence, Different Target: Decoding How Diagnostic Evidence Bears on Causal Questions from Language-Model States +1 more

5arXiv · cs.CL·4d ago·source ↗

Small VLMs have usable internal confidence signals that their verbalized outputs fail to express

A new arXiv paper evaluates two small open-weight vision-language models (Qwen2-VL-2B-Instruct and SmolVLM-Instruct) under six realistic image degradations, comparing verbalized confidence against internal token probability as uncertainty signals. Verbalized confidence is found to be nearly constant and near-chance at error detection (AUROC ~0.50), while internal token probability reliably separates correct from incorrect answers (AUROC 0.92–0.99). Both signals fail under severe underexposure, where accuracy collapses but confidence barely moves. The authors conclude that internal probability is the superior deferral signal for constrained deployments, and that small VLMs encode self-knowledge they cannot articulate.

Evaluation and Benchmarking AI Safety Research SmolVLM-Instruct Qwen2.5-7B-Instruct-1M Small Vision-Language Models Know When They Are Wrong But Cannot Say So

5arXiv · cs.AI·Jul 20, 2026·source ↗

Muon optimizer shows large gains over AdamW in sparse-reward agentic RL on ALFWorld

A new arXiv preprint investigates the Muon optimizer for reinforcement learning post-training of language model agents, comparing it to AdamW on the ALFWorld benchmark using Qwen2.5-0.5B-Instruct. Under Group-in-Group Policy Optimization (GiGPO), applying Muon to hidden weight matrices raises validation success from 0.290 to 0.546 (+88%), with further gains at lower learning rates reaching 0.901 success. The results are exploratory (single-seed, single-task) but suggest that optimizer choice, advantage estimator, and learning rate interact significantly in agentic RL settings.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld GRPO Qwen2.5-7B-Instruct-1M +4 more

4arXiv · cs.CL·Jul 14, 2026·source ↗

Token probability measurements reveal production-perception asymmetry in LLMs

A new arXiv preprint investigates whether LLMs exhibit a functional analog to the psycholinguistic production-perception distinction, using direct token probability measurements rather than metalinguistic prompting. Using Llama-3.1-8B and four other open-weight models, the authors find that production-perception prompt distances consistently exceed production-production distances by a ratio of ~1.8, with near-ceiling correlations in the production-production control confirming the effect is specific to communicative framing. The effect replicates across five models spanning base and instruction-tuned variants, and temporal analysis shows perception prompts exert strongest influence at sequence beginnings. The findings suggest prompt framing alone induces a production-perception distinction in decoder-only architectures.

Evaluation and Benchmarking Gemma 2 9B Qwen2.5-7B-Instruct-1M Mistral 7B Instruct v0.2 +2 more

5arXiv · cs.CL·Jul 13, 2026·source ↗

Test-time scaling for small VLMs on multilingual visual MCQ: conditions matter more than methods

A new arXiv paper examines whether test-time scaling (TTS) transfers to small open vision-language models using EXAMS-V, a multilingual visual multiple-choice benchmark. The study compares self-consistency, describe-then-reason with PRM-guided beam search, and post-hoc selectors across Qwen2.5-VL-7B-Instruct and Qwen3.5-4B. Key findings: prompt parseability and decoding budget (token limit) dominate gains, while elaborate search/verification methods like PRM-guided beam search underperform plain majority vote at 8x the cost. The best configuration achieves 84.1% on ImageCLEF 2026 test split, ranking first on the Visual MCQ leaderboard.

Evaluation and Benchmarking Inference Economics ImageCLEF 2026 Test-Time Scaling for Small VLMs on Multilingual Visual MCQ Qwen2.5-7B-Instruct-1M +3 more

5arXiv · cs.CL·Jul 8, 2026·source ↗

DynaKRAG: Learnable state-conditioned control policy for multi-hop RAG evidence acquisition

DynaKRAG is a new framework that formulates multi-hop retrieval-augmented generation as a state-conditioned control problem over atomic evidence operations (iterative retrieval, query reformulation, sufficiency judging, etc.), using a learned controller to select among valid operations at each step. Evaluated with Qwen2.5-7B-Instruct, it achieves F1 scores of 0.5998 on HotpotQA, 0.5340 on 2WikiMultiHopQA, and 0.3061 on MuSiQue, outperforming the strongest baselines on all three benchmarks. Ablations show that replacing the learned controller with a uniform policy costs 3.96–5.78 F1 points, and that additional retrieval is not uniformly beneficial.

Evaluation and Benchmarking Agent and Tool Ecosystem MuSiQue 2WikiMultiHopQA Qwen2.5-7B-Instruct-1M +2 more

3arXiv · cs.CL·Jul 7, 2026·source ↗

Schwartz-Geometry Decoding improves coherence in human value detection without sacrificing F1

A new arXiv preprint proposes injecting the circular Schwartz value continuum as an output-space geometry into multi-label classifiers for human value detection. The authors compare training-time geometry-aware objectives against a post-hoc energy decoder on a DeBERTa-v3-base model, finding that the decoder improves label-set coherence with the theoretical continuum without degrading Macro-F1 or Micro-F1. Training-time geometry injection yields only marginal gains, no better than a random ordering. A Qwen2.5-72B-Instruct diagnostic shows that supplying the continuum at inference shifts behavior but does not match supervised structured prediction.

Evaluation and Benchmarking Qwen2.5-7B-Instruct-1M Beyond Independent Labels: Schwartz-Geometry Decoding for Human Value Detection DeBERTa-v3

5arXiv · cs.CL·Jun 26, 2026·source ↗

Training framework reduces calibration error 60%+ in Medical VQA multimodal LLMs

A new arXiv preprint proposes a finetuning framework to improve verbalized uncertainty calibration in multimodal LLMs applied to Medical Visual Question Answering. The composite loss function combines Brier-style calibration, anchor regularization, contrastive image-text alignment, and KL-based stabilization, evaluated on MedGemma 4B IT and Qwen2-VL 7B Instruct across three medical VQA benchmarks. The method reduces calibration error by 60% or more and improves discrimination by 26% or more while preserving predictive accuracy, outperforming prompting-, sampling-, and training-based baselines.

Evaluation and Benchmarking AI Safety Research Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA Qwen2.5-7B-Instruct-1M MedGemma 4B IT +1 more

6arXiv · cs.CL·Jun 23, 2026·source ↗

BabelJudge: Benchmark for measuring LLM-as-a-judge reliability across languages and agent trajectories

BabelJudge is a new open-source benchmark and audit framework that systematically measures four failure modes in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The framework uses a 'gold-labelling by degradation' technique to generate labeled evaluation pairs without human annotation. Evaluation of Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili reveals severe cross-lingual reliability drops, with Swahili order consistency collapsing to near-random (0.480). The framework is extended to agentic evaluation with nine trajectory-level perturbations and three new metrics, released as a Python package supporting 11 judge backends.

Evaluation and Benchmarking Agent and Tool Ecosystem BabelJudge Qwen2.5-7B-Instruct-1M Shreyaskc

5arXiv · cs.CL·Jun 23, 2026·source ↗

LIHA reveals first-token broadcaster heads as mechanistic source of language identity in transformers

Researchers introduce Language Identity Head Ablation (LIHA), a causal intervention that zeros individual attention heads to measure language-switching behavior across 2,700 prompt-language pairs in seven languages. Applied to GPT-2, LIHA identifies a small set of 'first-token broadcaster' heads that propagate language identity signals throughout generation, with compensatory redistribution following a hierarchical, feedforward pattern. A controlled comparison between Qwen2.5-1.5B-Base and Qwen2.5-1.5B-Instruct provides direct causal evidence that instruction tuning reorganizes language identity circuits toward early-layer localization. The findings offer mechanistic grounding for why multilingual models generate in the wrong language and why this is difficult to correct.

Evaluation and Benchmarking Alignment and RLHF First-Token Broadcasters: Mechanistic Origins of Language Identity and Distributed Robustness in Transformers Language Identity Head Ablation Qwen2.5-7B-Instruct-1M +2 more

6arXiv · cs.CL·May 26, 2026·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

6arXiv · cs.CL·May 26, 2026·source ↗

Peak-Then-Collapse: RLVR Tool-Use Failures on Knowledge-Graph APIs

This paper investigates RLVR-based tool-use training (GRPO on Qwen2.5-7B-Instruct) on a minimal knowledge-graph API (Freebase over Complex WebQuestions) and documents a 'peak-then-collapse' pattern where tool-grounded answer rates rise then fall to zero within 50 steps, replicated across four seeds and seven reward designs. The authors identify a key structural difference between knowledge-graph APIs and other tool types (Python, web search, JSON): sparse, non-natural-language feedback signals (e.g., empty brackets '[]') prevent the model from recovering via pretraining-familiar error signals. A direct oracle ablation shows relation selection is not the bottleneck—95.4% of errors are retrieval-composition failures—and self-distillation reaches 40% EM at 7B, with capacity scaling to 14B yielding only marginal gains, suggesting an interface-bound ceiling.

Evaluation and Benchmarking Agent and Tool Ecosystem RLVR Self-Distillation GRPO +4 more

7Qwen Research·May 18, 2026·source ↗

Qwen2.5-1M: Open-Source Models with 1M Token Context Window Released

Alibaba's Qwen team has released two open-source models, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, extending context length to 1 million tokens. This follows the earlier upgrade of the proprietary Qwen2.5-Turbo to 1M context two months prior. The release includes inference framework support for deployment, marking the first time Qwen's open-weight models have reached this context length.

Long Context Evolution Open Weights Progress Qwen2.5-7B-Instruct-1M Alibaba Qwen +2 more