Almanac
Guide · In-depth

Alibaba's Qwen Program: A Tier-1 Open-Weight AI Lab from China

AlibabaIn-depthactive·v1 · live·generated 5d ago
TL;DRAlibaba's Qwen team has built one of the most prolific open-weight model programs in the world, shipping dense and MoE language models, multimodal systems, reasoning specialists, and coding agents across multiple generations in rapid succession. The program began with a clear open-source mandate but has recently pivoted its top-tier models toward closed weights and a revenue focus, even as it continues to release smaller open-weight variants under permissive licenses. Qwen now competes directly with frontier U.S. labs on benchmarks while serving as the backbone of a growing ecosystem of third-party fine-tunes and deployments.

Key takeaways

  • The flagship Qwen3-Coder-480B-A35B-Instruct is a 480B-parameter MoE model with 35B active parameters, 256K native context (1M via extrapolation), claiming performance comparable to Claude Sonnet 4 on agentic coding benchmarks.
  • Qwen3.5's 397B flagship outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks; its 9B model beats OpenAI's gpt-oss-120B on most language tasks despite being 13.5x smaller.
  • Qwen3.7-Max ranks fifth on the Artificial Analysis Intelligence Index and claims the lowest hallucination rate among frontier models tested — partly by declining to answer over half of prompts.
  • The program spans dense LLMs (0.5B–72B), MoE systems (up to 480B total / 35B active), vision-language models, audio models, math specialists, a translation model covering 92 languages, and a real-time omni model.
  • A strategic shift is underway: top-tier models (Qwen3.7-Max) are now closed-weights with a revenue focus, while open-weight releases continue at smaller scales under Apache 2.0.
  • A 2026 study found Qwen 2.5's post-training introduced the most extreme geopolitical bias shift of seven labs tested — an 18x increase in China-favourability log-odds — raising auditing concerns for enterprise deployments.

What Alibaba's Qwen program is

Alibaba's Qwen team is the AI research and model-development arm of Alibaba Group, responsible for one of the most prolific open-weight model programs to emerge from a Chinese technology company. Since its public launch — marked by a retrospective introduction published in early 2024 — the program has shipped dense language models, Mixture-of-Experts (MoE) systems, vision-language models, audio models, math specialists, a real-time omni model, a 92-language translation model, and a family of agentic coding models. Weights for smaller and mid-tier models are released under Apache 2.0 on Hugging Face and ModelScope; top-tier models are increasingly served via closed API on Alibaba Cloud's DashScope platform.

Model lineage and architecture choices

The Qwen series has iterated through several named generations, each expanding both scale and modality coverage.

Dense LLMs (Qwen1.5 → Qwen2 → Qwen2.5): The Qwen1.5 series established the open-weight baseline, culminating in Qwen1.5-110B — the team's first model exceeding 100 billion parameters, claiming comparable performance to Meta's Llama-3-70B on base benchmarks. Qwen2.5 followed three months after Qwen2 and was described at launch as potentially the largest open-source model release in history, spanning seven dense variants from 0.5B to 72B parameters. The 7B and 14B instruct variants were later extended to 1M-token context windows, with the proprietary Qwen2.5-Turbo reaching 1M tokens first.

MoE scaling (Qwen2.5-Max → Qwen3.5 → Qwen3-Coder): Qwen2.5-Max was the team's first publicly acknowledged frontier-scale MoE, developed concurrently with Qwen2 research and citing DeepSeek V3's disclosures as a reference point for MoE scaling insights. The Qwen3.5 family pushed MoE further, releasing eight open-weight vision-language models from 0.8B to 397B parameters, with the flagship Qwen3.5-397B-A17B outperforming GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks. The 9B model in the same family surpasses OpenAI's gpt-oss-120B on most language tasks despite being 13.5x smaller — a notable small-model efficiency result. Qwen3-Coder-480B-A35B-Instruct, released in July 2025, is the current open-weight coding flagship: 480B total parameters, 35B active per token, 256K native context with 1M-token extrapolation, and benchmark performance described as comparable to Claude Sonnet 4 on agentic coding, browser-use, and tool-use tasks.

Reasoning specialists (QwQ series): QwQ-32B-Preview (November 2024) introduced uncertainty and iterative questioning as explicit design principles for a reasoning-focused model. QwQ-32B (March 2025) followed with scaled reinforcement learning training, drawing explicit comparison to DeepSeek R1's cold-start and multi-stage RL approach. The Qwen team has also published Skill-RM, a reward modeling framework that treats evaluation as an agentic skill, enabling a single model to orchestrate heterogeneous evaluation criteria.

Multimodal coverage: Qwen2-VL (August 2024) extended vision-language capabilities to videos exceeding 20 minutes. Qwen2-Audio added audio-language understanding. QVQ-72B-Preview integrated visual understanding with advanced reasoning at 72B scale. Qwen2.5-Omni (March 2025) unified text, image, audio, and video processing in a 7B end-to-end model with real-time streaming output in both text and speech. HappyHorse-1.0 (a video generation model) was noted as the closest competitor to ByteDance's Seedance 2.0 on video leaderboards.

The agentic pivot and long-context strategy

Agentic capability has become the organizing theme of recent Qwen releases. Qwen3.7-Max — the current closed-weights flagship — is explicitly positioned for long-running agentic tasks including coding and scientific discovery, with a 1M-token context window and 208 tokens/second output speed. Its training approach separates task, agentic harness, and verifier components to prevent overfitting to specific task setups. On the Artificial Analysis Intelligence Index it ranks fifth to seventh, trailing leading U.S. models, but claims the lowest hallucination rate among frontier models tested — partly by declining to answer over half of prompts.

Long-context handling has been pursued through three parallel strategies: architectural training to native 1M-token windows (Qwen2.5-Turbo, Qwen2.5-1M open-weight models), context extrapolation beyond native training length (Qwen3-Coder's 1M via extrapolation from 256K), and agent-based decomposition (Qwen-Agent processed 1M-token documents using an 8K-native Qwen2 model by decomposing retrieval and reasoning tasks, then used the agent to generate synthetic training data for fine-tuning new long-context models — a self-improvement loop).

Infrastructure and research contributions

Beyond model releases, the Qwen team has published infrastructure work relevant to practitioners. OFASys (2022) provided a multimodal multitask training framework. A global-batch load balancing technique for MoE training was published in January 2025, addressing expert load imbalance as a near-free efficiency improvement. The CLP (Collocation-Length Predictor) paper demonstrated 1.14x–1.29x inference speedup on Qwen2.5 models via multi-token prediction with near-zero quality degradation, using a ~4.6K–7.7K parameter prediction layer. Qwen2.5-Math introduced a process reward model supervising intermediate reasoning steps rather than final answers only.

Strategic shift: open to closed at the frontier

The most significant recent development is a strategic pivot at the top of the model stack. Qwen3.7-Max is closed-weights, and reporting indicates leadership changes in the Qwen team consistent with a revenue-focused direction. This mirrors a pattern seen at other Chinese labs: open-weight releases continue for smaller models (driving ecosystem adoption and developer goodwill), while the most capable systems are reserved for API monetization. The Qwen3.5 small-model series (0.8B–9B, Apache 2.0) and Qwen3-Coder-480B (open-weight) suggest the open-weight commitment persists below the frontier tier.

Risks and considerations for practitioners

A 2026 study testing seven open-weight LLM pairs found that geopolitical bias is introduced during post-training rather than inherited from pre-training data. Qwen 2.5 showed the most extreme shift of the seven labs tested — an 18x increase in China-favourability log-odds — attributed to alignment and RLHF processes. The effect is also language-dependent. For practitioners deploying Qwen models in politically sensitive contexts or multilingual applications, this finding warrants explicit evaluation and auditing of post-training behavior.

Separately, research on Qwen3's thinking mode (chain-of-thought reasoning on/off) found that enabling thinking improves performance on "Planning" constraints (global counting, structure) but consistently worsens "Precision" constraints (exact local form), with 10–20% of prompts switching outcomes. This has practical implications for when to enable reasoning modes in instruction-following pipelines.

Ecosystem position

Qwen models are distributed via Hugging Face, ModelScope, DashScope (Alibaba Cloud API), and GitHub. The open-weight releases have generated a substantial third-party fine-tuning ecosystem; the Qwen3 chat template design alone warranted a dedicated Hugging Face blog post analyzing its encoding of reasoning modes and tool-use conventions. Qwen3-Coder-480B has appeared in domain-specific benchmarks such as PowerCodeBench (power-system simulation code generation), where it leads among open-weight models alongside Llama-3.1-405B.

Qwen model family landscape

Selected Qwen model generations at a glance

ModelTypeScaleKey capabilityWeights
Qwen2.5 (dense)Dense LLM0.5B–72BGeneral language; 1M-token context (7B/14B variants)Open (Apache 2.0)
Qwen2.5-MaxMoE LLMUndisclosedFrontier-scale MoE; concurrent with Qwen2 researchClosed API
QwQ-32BReasoning LLM32BRL-scaled reasoning; comparable to DeepSeek R1 approachOpen
Qwen2.5-OmniOmni multimodal7BText + image + audio + video; real-time streamingOpen
Qwen3.5 VLMVision-language MoE0.8B–397B397B tops 28/44 vision benchmarks vs. GPT-5.2 / Claude 4.5 OpusOpen (Apache 2.0)
Qwen3.7-MaxAgentic LLMUndisclosed1M context; 5th–7th on AA Intelligence Index; closed weightsClosed API
Qwen3-Coder-480BCode MoE480B total / 35B activeSOTA open-weight agentic coding; ~Claude Sonnet 4 performanceOpen

Synthesized from the events bundle; undisclosed architecture cells render as noted.

Timeline

  1. OFASys multimodal multitask framework released — early Qwen infrastructure

  2. Qwen-VL-Plus/Max launched; Qwen series retrospective published as canonical reference

  3. Qwen1.5-110B: first open-weight Qwen model exceeding 100B parameters

  4. Qwen2.5 family released — described as potentially the largest open-source model release in history

  5. Qwen2.5-Turbo extends context to 1M tokens; QwQ-32B-Preview reasoning model released

  6. QwQ-32B (RL-scaled reasoning) and Qwen2.5-Omni (real-time omni model) released

  7. Qwen3.5 VLM family (0.8B–397B) released; 397B tops 28/44 vision benchmarks

  8. Qwen3.7-Max launched as closed-weights agentic flagship; signals revenue-focused pivot

  9. Qwen3-Coder-480B-A35B released — SOTA open-weight agentic coding MoE

Related topics

QwenQwen TeamQwen2.5Qwen3Qwen2.5-MaxMixture of ExpertsHugging FaceModelScopeOpenAIAnthropicDeepSeek V4

FAQ

Is Qwen open-source?

Partially — smaller and mid-tier models are released under Apache 2.0 on Hugging Face and ModelScope, but the top-tier flagship (Qwen3.7-Max) is now closed-weights and available only via API, reflecting a recent strategic pivot toward revenue.

How does Qwen compare to leading U.S. models?

Qwen3.7-Max ranks fifth to seventh on the Artificial Analysis Intelligence Index, trailing leading models from OpenAI, Anthropic, and Google; the Qwen3.5-397B VLM outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks, and Qwen3-Coder-480B claims performance comparable to Claude Sonnet 4 on agentic coding.

What is the MoE architecture used in Qwen models?

Mixture-of-Experts (MoE) models activate only a subset of parameters per token — for example, Qwen3-Coder-480B activates 35B of its 480B parameters — enabling frontier-scale capacity at lower inference cost than equivalent dense models.

What geopolitical bias risk has been identified in Qwen models?

A 2026 study found that Qwen 2.5's post-training introduced the most extreme geopolitical bias shift among seven labs tested, with an 18x increase in China-favourability log-odds, attributed to alignment and RLHF processes rather than pre-training data.

What is Qwen's approach to long-context handling?

Qwen has pursued multiple strategies: architectural training (Qwen2.5-Turbo and open-weight 7B/14B models at 1M tokens), context extrapolation (Qwen3-Coder supports 1M via extrapolation from 256K native), and agent-based decomposition (Qwen-Agent processed 1M-token documents using an 8K-native model by decomposing retrieval and reasoning tasks).

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live5d ago

Related guides (4)

More on Alibaba (6)

7The Batch·18d ago·source ↗

Alibaba releases Qwen3.5 open-weights vision-language model family with MoE architecture across eight sizes

Alibaba released the Qwen3.5 family of eight open-weights vision-language models ranging from 0.8B to 397B parameters, built on a mixture-of-experts architecture with mixed attention and Gated DeltaNet layers. The flagship Qwen3.5-397B-A17B outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks, while the 9B model surpasses OpenAI's gpt-oss-120B on most language tasks. Open weights are available under Apache 2.0, with hosted agentic variants (Qwen3.5-Plus, Qwen3.5-Flash) available via Alibaba Cloud. The release is notable for strong small-model efficiency and comes amid reported team departures following the Qwen3 rollout.

6The Batch·15d ago·source ↗

Alibaba's Qwen3.7-Max positions as top Chinese LLM with closed weights and agentic focus

Alibaba released Qwen3.7-Max, a closed-weights proprietary model targeting long-running agentic tasks like coding and scientific discovery, with a 1M-token context window and 208 tokens/second output speed. The model ranks fifth to seventh on the Artificial Analysis Intelligence Index, trailing leading U.S. models from OpenAI, Anthropic, and Google but claiming the lowest hallucination rate among frontier models tested—partly by declining to answer over half of prompts. Alibaba's training approach separates task, agentic harness, and verifier components to prevent overfitting to specific setups. The release continues Alibaba's strategic shift from open to closed weights for top-tier models, with leadership changes in the Qwen team suggesting a revenue-focused pivot.

7Qwen Research·1mo ago·source ↗

Qwen2.5-Turbo Extends Context Length to 1M Tokens

Alibaba's Qwen team has released Qwen2.5-Turbo, extending the model's context window from 128K to 1 million tokens (approximately 1 million English words). The update includes optimizations for both model capabilities and inference performance at extreme context lengths. The model is available via API and through HuggingFace and ModelScope demos.

8Qwen Research·1mo ago·source ↗

Qwen2.5-LLM: Alibaba releases open-weight language models from 0.5B to 72B

Alibaba's Qwen team releases the Qwen2.5 series of decoder-only dense language models, open-sourcing seven variants spanning 0.5B to 72B parameters. The release targets production use cases in the 10-30B range and mobile deployments at 3B scale. This represents a significant expansion of the open-weights frontier from a Tier 1 Chinese AI lab.

7Qwen Research·1mo ago·source ↗

Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding

Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.

8Qwen Research·1mo ago·source ↗

Qwen2.5: Large-Scale Open-Source Foundation Model Family Release

Alibaba's Qwen team has released Qwen2.5, described as potentially the largest open-source model release in history, following three months of development after Qwen2. The release encompasses a family of foundation models with improvements in knowledge and reasoning capabilities. The announcement targets developers who have been building on Qwen2 and incorporates feedback from that community.