Qwen2.5-7B-Instruct-1M
qwen2-5-7b-instruct-1m-be09a203·3 events·first seen 1mo agoAliases: Qwen2.5-7B-Instruct-1M, Qwen2.5-14B-Instruct-1M, Qwen2.5-7B-Instruct, Qwen2.5-14B-Instruct, qwen2.5-14B-Instruct
Co-occurring entities
More like this (12)
Recent events (3)
Qwen2.5-1M: Open-Source Models with 1M Token Context Window Released
Alibaba's Qwen team has released two open-source models, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, extending context length to 1 million tokens. This follows the earlier upgrade of the proprietary Qwen2.5-Turbo to 1M context two months prior. The release includes inference framework support for deployment, marking the first time Qwen's open-weight models have reached this context length.
Peak-Then-Collapse: RLVR Tool-Use Failures on Knowledge-Graph APIs
This paper investigates RLVR-based tool-use training (GRPO on Qwen2.5-7B-Instruct) on a minimal knowledge-graph API (Freebase over Complex WebQuestions) and documents a 'peak-then-collapse' pattern where tool-grounded answer rates rise then fall to zero within 50 steps, replicated across four seeds and seven reward designs. The authors identify a key structural difference between knowledge-graph APIs and other tool types (Python, web search, JSON): sparse, non-natural-language feedback signals (e.g., empty brackets '[]') prevent the model from recovering via pretraining-familiar error signals. A direct oracle ablation shows relation selection is not the bottleneck—95.4% of errors are retrieval-composition failures—and self-distillation reaches 40% EM at 7B, with capacity scaling to 14B yielding only marginal gains, suggesting an interface-bound ceiling.
Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation
This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.