Entity · company

Alibaba

companyactivealibaba-06df8145·60 events·first seen May 18, 2026

Aliases: Alibaba

Co-occurring entities

Qwen Anthropic OpenAI Qwen3 ModelScope Qwen2.5 Google Hugging Face Artificial Analysis Intelligence Index NVIDIA Meta Qwen2.5-Max HuggingFace Kimi K2 Moonshot AI GPT-Realtime-2 ByteDance Qwen3.5 Omni Qwen3-4B Tencent

More like this (12)

Alibaba Qwen Alibaba DAMO Academy Amazon Alibaba Cloud Alibaba Qwen Team Baidu AllenAI Huawei PayPal China Amazon Web Services IBM

Guides (1)

Alibaba

Alibaba: The Tech Giant Quietly Reshaping Open AI

Read asBeginner In-depth

Recent events (50)

All 60 events →

4arXiv · cs.CL·3d ago·source ↗

Controlled factorial study disentangles architecture, model variant, and scale effects in LLM-based entity matching

A new arXiv preprint presents a controlled factorial study of language model-based entity matching across three matcher architectures (bi-encoder, cross-encoder, generative), three model variants, and three model sizes from the Qwen3 family, totaling 1,215 fine-tuning runs on nine datasets. Key findings include: model variant (pretraining objective) is more important than scale for bi-encoders; cross-encoders consistently outperform bi-encoders but larger models narrow the gap; generative matchers only outperform cross-encoders under distribution shift; and larger models are more prone to shortcut learning. The study also evaluates cross-dataset transferability and computational cost, releasing all code and results.

Evaluation and Benchmarking Alibaba Beyond Scale and Generation: Understanding Language Model-based Entity Matching Qwen3

6arXiv · cs.CL·4d ago·source ↗

Skill Self-Play: Co-evolving LLM capabilities via structured self-play with dynamic skill routing

Researchers introduce Skill Self-Play (Skill-SP), a reinforcement learning framework that addresses the diversity-vs-verifiability dilemma in LLM self-evolution by using agent skills as a middle ground. The system comprises a proposer, solver, and dynamic skill controller that co-evolve in a continuous loop: the proposer generates tasks conditioned on sampled skills, the solver explores solutions, and the skill controller updates an expanding skill library based on execution feedback. Evaluations on tool-use and reasoning benchmarks show consistent performance gains on capable backbones and recovery for initially misaligned models. Code is released under the Qwen-Applications GitHub organization, suggesting Alibaba/Qwen team involvement.

Frontier Model Releases Agent and Tool Ecosystem Skill Self-Play: Pushing the Frontier of LLM Capability with Co-Evolving Skills Alibaba Qwen +2 more

8The Batch·Jul 24, 2026·source ↗

Moonshot AI's Kimi K3 (2.8T-parameter MoE) ranks third on Intelligence Index, first among open-weights models

Moonshot AI released Kimi K3, a 2.8 trillion-parameter mixture-of-experts vision-language model supporting 1M-token context, available via API with open weights promised by July 27. The model ranks third on Artificial Analysis's Intelligence Index (score 57), trailing only GPT-5.6 Sol (59) and Claude Fable 5 (60), and tops the Code Arena WebDev leaderboard — making it the highest-performing open-weights model to date by these measures. Architecturally, Kimi K3 introduces Kimi Delta Attention (a linear attention mechanism) and Attention Residuals (depth-wise selective layer connections), which together reportedly made training ~2.5x more compute-efficient than its predecessor. The article also notes that Alibaba launched Qwen3.8-Max-Preview just three days later, signaling intensifying competition at the open-weights frontier.

Frontier Model Releases Open Weights Progress GPT-5.6 Sol Kimi K2 Artificial Analysis Intelligence Index +13 more

6Hacker News·Jul 21, 2026·source ↗

Alibaba releases Qwen-Image-3.0 multimodal model

Alibaba's Qwen team released Qwen-Image-3.0, a new image-generation or image-understanding model emphasizing rich content, authentic details, and deep knowledge. The announcement is drawing significant community attention on Hacker News with 416 points and 174 comments. This represents a notable update in the competitive multimodal AI space from a major Chinese lab.

Open Weights Progress Multimodal Progress Alibaba Qwen-Image-3.0

5Hacker News·Jul 20, 2026·source ↗

Commentary: Kimi K3, Qwen 3.8, and Anthropic's competitive position

A piece from Emerging Trajectories analyzes the competitive dynamics between Moonshot AI's Kimi K3, Alibaba's Qwen 3.8, and Anthropic's strategic position, framing the latter as potentially under pressure. The article surfaced on Hacker News with 248 points and 248 comments, indicating significant community engagement. The framing suggests concern about Anthropic's ability to maintain frontier status as Chinese labs release competitive models.

Frontier Model Releases Open Weights Progress Alibaba Kimi K3 Qwen3 +2 more

5Hacker News·Jul 19, 2026·source ↗

Qwen 3.8 Max Preview appears on QwenCloud pricing page

A QwenCloud pricing/token-plan page referencing 'Qwen 3.8 Max Preview' surfaced on Hacker News, suggesting Alibaba's Qwen team is preparing or has quietly launched a new flagship model version. The item is a community signal rather than an official announcement, with minimal detail beyond the model name appearing in a pricing context. If confirmed, this would represent a new major release in the Qwen model series.

Frontier Model Releases Open Weights Progress QwenCloud Alibaba Qwen 3.7 Max

6Hacker News·Jul 19, 2026·source ↗

Alibaba Qwen releases Qwen 3.8 model

Alibaba's Qwen team announced Qwen 3.8, a new model in the Qwen 3 series. The announcement generated significant community engagement on Hacker News with 416 points and 314 comments. Details on capabilities and benchmarks are not available from this source snippet alone, but the community response suggests notable interest in the release.

Frontier Model Releases Open Weights Progress Alibaba Qwen3

7The Batch·Jul 17, 2026·source ↗

OpenAI GPT-Live Pairs Full-Duplex Voice Models with GPT-5.5 Reasoning Backend

OpenAI released GPT-Live-1 and GPT-Live-1 mini on July 8, 2026, replacing Advanced Voice Mode with a full-duplex voice system that processes audio continuously and delegates harder queries to GPT-5.5 in the background. The architecture separates a real-time conversational voice model from a reasoning model, with user-selectable reasoning effort levels (Instant, Medium, High) routing to GPT-5.5 Instant or GPT-5.5 Thinking accordingly. Performance gains are substantial: GPQA scores jumped from 45.3% (AVM) to 84.2% (GPT-Live-1 at high reasoning), and BrowseComp improved from 0.7% to 75.2%. The system is live globally on iOS, Android, and ChatGPT.com for paid plans, though no developer API has shipped yet.

Frontier Model Releases Agent and Tool Ecosystem Thinking Machines GPT-Live ChatGPT +18 more

6The Batch·Jul 16, 2026·source ↗

Data Points: PrismML fits 27B model on iPhone; Cognition SWE-1.7, Nvidia Audex, Anthropic language-value study

A newsletter digest covers four notable AI developments: PrismML (a Caltech/Khosla spinout) compressed Alibaba's Qwen 27B model to under 4 GB via ternary/binary quantization for on-device iPhone inference; Cognition released SWE-1.7 (trained on Kimi K2.7), jumping from 9.4% to 42.3% on FrontierCode 1.1 Main with novel RL and infrastructure techniques; Nvidia introduced Audex, a 30B unified audio-text transformer trained on 157B audio tokens; and Anthropic published research showing Claude's expressed values shift measurably by language across 309,815 conversations. Each item represents a distinct technical development across on-device inference, coding agents, multimodal models, and model behavior analysis.

Inference Economics Agent and Tool Ecosystem Kimi K2 Claude Sonnet Claude Opus 4.6 +18 more

6arXiv · cs.CL·Jul 10, 2026·source ↗

Auditing LLM-as-Judge reliability: judge upgrades are not interchangeable across model families

A new arXiv paper investigates measurement validity problems in LLM-as-judge evaluation, finding that swapping evaluator models changes scores even when candidate responses are fixed. Across four judgment datasets, the authors compare Qwen3 dense judges (1.7B–32B) and MiniMax M2/M2.7 API releases, finding that only the Qwen3 1.7B→4B upgrade yields robust adjacent gains while MiniMax adjacent releases do not. Stronger judges reduce but do not eliminate position and verbosity bias, and repeated-sample juries add little when errors are correlated. The paper argues for standardized reporting requirements including dataset slices, bias probes, error-dependence estimates, and protocol audit trails.

Evaluation and Benchmarking AI Safety Research When the Judge Changes, So Does the Measurement: Auditing LLM-as-Judge Reliability MiniMax Alibaba +1 more

7The Batch·Jul 8, 2026·source ↗

GPT-5.6 wider API release imminent after government delay; roundup covers Microsoft MAI shift, Claude Cowork mobile, Nvidia Audex, OpenAI mini voice

OpenAI's GPT-5.6 models are set for broader API release following a Department of Commerce-approved safety review that delayed launch for weeks; GPT-5.6 Sol Ultra scores 91.9% on TerminalBench 2.1 versus Claude Mythos 5 at 88%, with pricing roughly half of Anthropic's comparable tier. Microsoft is actively replacing OpenAI and Anthropic models in Excel, Outlook, and Teams with its internally built MAI models to reduce third-party dependency as its OpenAI discount partnership nears expiration. Anthropic expanded Claude Cowork to web and mobile for Max plan subscribers, with usage data from 1.2 million sessions showing over 90% of use is non-developer work. Nvidia released Audex, a 30B MoE audio-text model that avoids the typical 'text tax' of multimodal models, shipping under a noncommercial license.

Frontier Model Releases Inference Economics Claude Mythos Center for AI Standards and Innovation Microsoft +19 more

6The Batch·Jul 8, 2026·source ↗

The Batch digest: China bans anthropomorphic bots, DiffusionGemma, Anthropic Claude Code study, Seedance 2.5, Code Arena

A multi-story digest covers five distinct AI developments: ByteDance and Alibaba are shutting down customizable humanlike AI agents ahead of China's July 15 Interim Measures for AI-Based Anthropomorphic Interactive Services; Google released DiffusionGemma, an experimental 26B MoE diffusion-based text model generating 256-token blocks at 1,000+ tokens/sec on H100; Anthropic published findings from 400,000 Claude Code sessions showing domain expertise—not coding skill—drives agentic output volume; Seedance released version 2.5 of its video generator with higher resolution and longer clips; and Arena.ai expanded Code Arena to fullstack web development evaluation. The China regulatory action is the most significant item, representing a concrete enforcement moment for AI persona/companion regulation.

Frontier Model Releases Evaluation and Benchmarking Seedance 2.0 Doubao DiffusionGemma +13 more

6Hacker News·Jul 3, 2026·source ↗

Alibaba reportedly banning Claude Code over alleged backdoor risks

Alibaba is reportedly planning to ban the use of Claude Code in its workplace, citing alleged backdoor risks, according to a Reuters source. The move reflects growing geopolitical and security tensions around the use of US-developed AI coding tools inside Chinese corporations. If confirmed, this signals a significant enterprise-level rejection of Anthropic's developer tooling in a major market.

Enterprise Deployment Patterns Agent and Tool Ecosystem Reuters Alibaba Claude Code +1 more

4Hacker News·Jun 29, 2026·source ↗

Community discussion: Qwen 3.6 27B praised as sweet spot for local development

A blog post from Quesma, amplified on Hacker News with 466 points and 412 comments, argues that Qwen 3.6 27B is an optimal model for local development workflows. The high engagement suggests significant community interest in this open-weights model as a practical local inference choice. The discussion likely covers performance-per-resource tradeoffs relevant to practitioners running models on consumer hardware.

Open Weights Progress Inference Economics Qwen 3.5 27B Alibaba Quesma

7arXiv · cs.CL·Jun 25, 2026·source ↗

Study finds real-time voice AI systems ignore vocal delivery cues despite perceiving them

A new arXiv paper evaluates four production real-time voice AI systems — OpenAI GPT Realtime 2, Google Gemini 3.1 Flash Live, Qwen3.5 Omni Plus, and Qwen3.5 Omni Flash — on tasks where vocal delivery (distress, fear, sarcasm) carries meaningful information distinct from word content. All four systems consistently act on words alone, ending calls with crying users who deny distress, approving frightened-voice wire transfers, and accepting sarcastic consent. Critically, three of four systems can correctly identify the emotional state when asked directly, revealing a gap between perception and decision-making the authors term the 'emotional intelligence gap.' Prompting systems to attend to vocal delivery improves performance only partially and inconsistently.

Evaluation and Benchmarking AI Safety Research Qwen3.5 Omni Flash GPT-Realtime-2 Google +6 more

7arXiv · cs.CL·Jun 24, 2026·source ↗

Qwen-AgentWorld: Language world models for general agent simulation and planning

Alibaba's Qwen team introduces Qwen-AgentWorld, a pair of language world models (35B-A3B and 397B-A17B) trained to simulate agentic environments across 7 domains using over 10M interaction trajectories. The models are trained via a three-stage pipeline (CPT, SFT, RL) and evaluated on AgentWorldBench, a new benchmark constructed from 5 frontier models across 9 established benchmarks. Beyond simulation, the work demonstrates two downstream use cases: using the world model as a decoupled RL training environment and as a warm-up for agent foundation models, both yielding gains over baselines.

Frontier Model Releases Evaluation and Benchmarking AgentWorldBench Qwen-AgentWorld-35B-A3B Alibaba +3 more

5Github Trending·Jun 23, 2026·source ↗

Alibaba releases page-agent: JavaScript in-page GUI agent for natural language web control

Alibaba has published page-agent, an open-source TypeScript library that enables natural language control of web interfaces directly in the browser. The project has accumulated 19,213 GitHub stars with 425 added today, indicating strong community interest. It represents a browser-native approach to GUI agents, distinct from server-side or desktop automation frameworks.

Agent and Tool Ecosystem page-agent Alibaba

6arXiv · cs.CL·Jun 17, 2026·source ↗

Location metadata causes systematic geographic bias leakage in LLMs, even with 'Unknown' placeholders

Researchers evaluate 'location leakage' — the phenomenon where LLMs generate geographically biased outputs when exposed to location metadata in user profiles, even when prompts are geographically neutral. Across creative writing and Q&A tasks, leakage spikes up to 793x above baseline for models including Llama 3.1-8B, Qwen3-8B, and Claude Sonnet 4.6. A novel structural finding shows that replacing location with 'Unknown' still elevates leakage by up to 72x, indicating the user profile frame itself acts as a conditioning signal independent of geographic content. This has direct implications for AI systems that use user metadata for localization.

Evaluation and Benchmarking AI Safety Research Claude Sonnet 4 Alibaba Qwen3-4B +4 more

5arXiv · cs.AI·Jun 10, 2026·source ↗

CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup

Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.

Inference Economics Qwen2.5 Alibaba CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference +2 more

5arXiv · cs.CL·Jun 9, 2026·source ↗

Study finds thinking mode in LRMs shifts instruction-following errors by constraint type rather than uniformly degrading performance

A new arXiv paper investigates how enabling built-in chain-of-thought reasoning ('Thinking ON/OFF') in Qwen3 and Hunyuan models affects instruction following on IFEval. Aggregate pass-rate changes are small but 10-20% of prompts switch outcomes, with 'Planning' constraints (global counting, structure) improving under thinking while 'Precision' constraints (exact local form) consistently worsen. Activation patching and trace-relevance analyses reveal an execution gap: thinking traces engage with Planning constraints but fail to translate that engagement into compliance, while Precision failures are more mechanistically recoverable. The findings have practical implications for when to enable reasoning modes in instruction-following applications.

Frontier Model Releases Evaluation and Benchmarking When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following Hunyuan Alibaba +3 more

6The Batch·Jun 5, 2026·source ↗

The Batch Issue 356: Qwen3.7-Max release, White House AI executive order, fine-tuning breaks copyright alignment

The Batch issue 356 covers several distinct AI developments: Alibaba's release of Qwen3.7-Max, a closed-weights flagship LLM targeting agentic coding and scientific tasks with a novel RL training approach that decouples task, harness, and verifier; a new White House executive order on frontier AI models focused on cybersecurity, including voluntary model-sharing with government; and a finding that fine-tuning breaks copyright alignment in LLMs. Andrew Ng's editorial commentary frames the executive order as a reasonable compromise, noting Anthropic's Mythos vulnerability-detection model as a key driver of the cybersecurity concerns behind the regulation.

Frontier Model Releases AI Safety Research Qwen3.7-Plus-Preview DeepLearning.AI Artificial Analysis Intelligence Index +9 more

6The Batch·Jun 5, 2026·source ↗

Alibaba's Qwen3.7-Max positions as top Chinese LLM with closed weights and agentic focus

Alibaba released Qwen3.7-Max, a closed-weights proprietary model targeting long-running agentic tasks like coding and scientific discovery, with a 1M-token context window and 208 tokens/second output speed. The model ranks fifth to seventh on the Artificial Analysis Intelligence Index, trailing leading U.S. models from OpenAI, Anthropic, and Google but claiming the lowest hallucination rate among frontier models tested—partly by declining to answer over half of prompts. Alibaba's training approach separates task, agentic harness, and verifier components to prevent overfitting to specific setups. The release continues Alibaba's strategic shift from open to closed weights for top-tier models, with leadership changes in the Qwen team suggesting a revenue-focused pivot.

Frontier Model Releases Open Weights Progress Qwen3.7-Plus-Preview Alibaba Cloud Model Studio Artificial Analysis Intelligence Index +8 more

7The Batch·Jun 4, 2026·source ↗

Microsoft Build: Seven in-house AI models, GitHub Copilot desktop agent manager, and Web IQ search API for agents

Microsoft announced seven new AI models trained from scratch (not distilled from OpenAI), including the flagship MAI-Thinking-1 reasoning model and MAI-Transcribe-1.5, plus a 'Frontier Tuning' reinforcement learning approach for enterprise workflow training. GitHub released a desktop Copilot app designed to manage multiple parallel AI agents with isolated git worktrees and bidirectional canvases. Microsoft also launched Web IQ, an agent-native Bing-powered grounding API already powering search in Copilot and ChatGPT, running 2.5x faster than alternatives with lower token costs. The roundup also covers Nous Research's Hermes Desktop cross-platform agent app, Alibaba's Qwen3.7-Plus multimodal model, and OpenAI's role-specific Codex plugins.

Frontier Model Releases Inference Economics MAI-Thinking-1 FLEURS Frontier Tuning +15 more

6The Batch·Jun 3, 2026·source ↗

Qwen3.5 Small tops mobile-sized open models; GPT-5.3 Instant, Gemini 3.1 Flash-Lite, Claude memory import, and LLM deanonymization research

Alibaba released the Qwen3.5 Small model series (0.8B–9B parameters) with a hybrid Gated Delta Networks + sparse MoE architecture, with the 9B model outperforming OpenAI's gpt-oss-120B on GPQA Diamond despite being 13.5x smaller; all weights are Apache 2.0 licensed. Google introduced Gemini 3.1 Flash-Lite, a cost-optimized model at $0.25/M input tokens with 2.5x faster TTFT than Gemini 2.5 Flash. OpenAI released GPT-5.3 Instant targeting conversational quality improvements and hallucination reduction, while Anthropic added memory import/export functionality across all Claude tiers. Separately, researchers from MATS, Anthropic, and ETH Zurich demonstrated that LLM-based pipelines can deanonymize pseudonymous online users at 68% recall/90% precision for $1–4 per profile.

Frontier Model Releases Open Weights Progress Claude Google Alibaba +14 more

6arXiv · cs.LG·Jun 3, 2026·source ↗

Skill-RM: A unified reward model framework treating evaluation as an agentic skill

Researchers from the Qwen team propose Skill-RM, a framework that reformulates reward modeling as the execution of a reusable 'Reward-Evaluation Skill,' enabling a single model to orchestrate heterogeneous evaluation criteria including rule-based verifiers, ground-truth references, and rubrics. By treating reward computation as a structured agentic task, Skill-RM dynamically selects and aggregates evidence per input rather than relying on static evaluation. Experiments on reward benchmarks and downstream tasks (best-of-N selection, RL) show consistent improvements over traditional judge baselines. The code is publicly released under the Qwen-Applications GitHub organization.

Evaluation and Benchmarking Agent and Tool Ecosystem Skill-RM Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill Alibaba +2 more

7The Batch·Jun 2, 2026·source ↗

Alibaba releases Qwen3.5 open-weights vision-language model family with MoE architecture across eight sizes

Alibaba released the Qwen3.5 family of eight open-weights vision-language models ranging from 0.8B to 397B parameters, built on a mixture-of-experts architecture with mixed attention and Gated DeltaNet layers. The flagship Qwen3.5-397B-A17B outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks, while the 9B model surpasses OpenAI's gpt-oss-120B on most language tasks. Open weights are available under Apache 2.0, with hosted agentic variants (Qwen3.5-Plus, Qwen3.5-Flash) available via Alibaba Cloud. The release is notable for strong small-model efficiency and comes amid reported team departures following the Qwen3 rollout.

Frontier Model Releases Open Weights Progress GPT-5.2 Alibaba Cloud Model Studio Claude Opus 4.6 +10 more

7The Batch·Jun 2, 2026·source ↗

Data Points: OpenAI shuts down Sora, Anthropic multi-agent harness, EVA voice benchmark, Arm AGI CPU, White House AI preemption proposal

OpenAI is shutting down its Sora text-to-video platform without explanation, ending a major Disney licensing deal worth up to $1 billion and eliminating video capabilities from ChatGPT amid Hollywood copyright tensions. Anthropic published details on a multi-agent harness enabling Claude to build full-stack applications over multi-hour sessions using a planner-generator-evaluator architecture. ServiceNow AI Research released EVA, an open-source two-dimensional benchmark for voice agents measuring both task accuracy and conversational experience quality. Additional items cover Arm's first self-designed data center CPU (AGI CPU) co-developed with Meta, and the Trump Administration's legislative proposal for a federal AI framework that would preempt state AI laws.

Training Infrastructure Frontier Model Releases ServiceNow AI Research ClawBot Playwright +19 more

7The Batch·Jun 2, 2026·source ↗

Nvidia releases Nemotron 3 Super 120B-A12B open-weights model with hybrid Mamba-2/MoE architecture

Nvidia released Nemotron 3 Super 120B-A12B, an open-weights LLM with a hybrid Mamba-2/transformer/MoE architecture that activates only 12B parameters per token and supports up to 1 million token context. The model claims the fastest inference speed in its size class at 442 tokens/second and leads open-weights models on PinchBench agentic task evaluation, outperforming larger models including Kimi K2.5 (1T parameters). Nvidia is releasing weights, training data, and recipes under a permissive commercial license, and plans a $26B five-year investment in open-weights models — framed partly as a strategic response to Chinese labs building capable open-weights models on non-Nvidia hardware.

Frontier Model Releases Open Weights Progress Nemotron 3 Super 120B-A12B Nemotron 3 Ultra-500B-A50B PivotRL +18 more

7The Batch·Jun 1, 2026·source ↗

ByteDance Deploys Seedance 2.0 Video Model to CapCut's 736M Users as OpenAI Shutters Sora

ByteDance has integrated Seedance 2.0, its multimodal video generation model, into CapCut for paying users across multiple global regions, reaching a platform with approximately 736 million monthly active users. The model supports text, image, audio, and video inputs, generates synchronized audio-video output in a single pass including multi-shot sequences, and ranks in the top two on Arena AI and Artificial Analysis video leaderboards, with Alibaba's HappyHorse-1.0 as its closest competitor. Simultaneously, OpenAI is discontinuing the Sora app and API after daily active users fell below 500,000 and operating costs reached an estimated $1 million per day. The contrast illustrates a broader market shift where Chinese developers are accelerating video model releases while U.S. consumer video products retreat.

Frontier Model Releases Evaluation and Benchmarking Seedance 2.0 Artificial Analysis CapCut +15 more

5arXiv · cs.CL·Jun 1, 2026·source ↗

PowerCodeBench: Knowledge Boundary Probing and Intervention for LLM-Based Power System Code Generation

This paper introduces PowerCodeBench, an execution-validated benchmark for evaluating LLMs on power-system simulation code generation using the pandapower library. The authors identify that failures are dominated by API-knowledge boundary errors (hallucinated function names, misused parameters) rather than reasoning failures, and propose a boundary-aware intervention combining API demand estimation with targeted documentation injection. Evaluated across ten open-weight models (1.5B–480B) and four commercial APIs on 2,000 tasks, the intervention yields 32–56 accuracy point improvements while using only 41% of baseline prompt-token cost. Open-weight models in the 70B–120B range match commercial mid-tier accuracy, with Llama-3.1-405B and Qwen3-Coder-480B leading.

Evaluation and Benchmarking Open Weights Progress pandapower Meta Llama 3.1 405B Alibaba +7 more

7The Batch·Jun 1, 2026·source ↗

Data Points: Qwen3.7-Max, OpenAI Math Proof, Gated DeltaNet-2, Trump AI Order, Microsoft Fara1.5

This edition of The Batch covers five significant AI developments: Alibaba's Qwen3.7-Max reasoning model with 1M token context and agentic capabilities ranking fifth on the Artificial Analysis Intelligence Index; an OpenAI reasoning model resolving the 80-year-old Erdős planar unit distance problem; Nvidia's Gated DeltaNet-2 outperforming Mamba-3 and other linear attention architectures; Trump pulling back a proposed AI regulation executive order; and Microsoft Research's Fara1.5 computer-use agent family beating OpenAI Operator and Google Gemini on the Online-Mind2Web benchmark.

Long Context Evolution Frontier Model Releases Paul Erdős Fara1.5 Mamba +25 more

7arXiv · cs.AI·May 25, 2026·source ↗

Geopolitical Bias in LLMs Originates in Post-Training, Not Pre-Training Data

A study testing seven open-weight LLM pairs (base vs. chat models) across seven labs finds that geopolitical bias is introduced during post-training rather than inherited from pre-training data. Six of seven labs showed post-training shifts favoring the developer's home country or region, with Alibaba's Qwen 2.5 showing the most extreme shift (18x increase in China-favourability log-odds). The effect is also language-dependent: Mistral becomes pro-France only under French prompting. The authors argue this implicates alignment and RLHF processes as active shapers of geopolitical perspective, calling for greater transparency and auditing of post-training pipelines.

Evaluation and Benchmarking Open Weights Progress Mistral AI Alibaba Mistral +6 more

7The Batch·May 23, 2026·source ↗

Thinking Machines Lab Reveals TML-Interaction-Small: Real-Time Multimodal Interaction Model

Thinking Machines Lab (founded by Mira Murati) has announced TML-Interaction-Small, a 276B-parameter mixture-of-experts multimodal model that processes audio, video, and text concurrently using 200ms 'micro-turns' rather than waiting for conversational turns to complete. The architecture uses encoder-free early fusion, pairing a fast foreground interaction model with an asynchronous background reasoning model that shares context. On interactivity benchmarks (FD-bench V1/V1.5), it outperforms GPT-Realtime-2 and Gemini-3.1-flash-live-preview, though it trails GPT-Realtime-2 on intelligence benchmarks. A closed research preview is expected in coming months with wider release later in 2026.

Frontier Model Releases Inference Economics encoder-free early fusion Thinking Machines GPT-Realtime-2 +16 more

7Hacker News·May 20, 2026·source ↗

Qwen3.7-Max: The Agent Frontier

Alibaba's Qwen team has announced Qwen3.7-Max, positioned as a frontier model for agentic tasks. The announcement appears on the official Qwen blog and generated significant community discussion on Hacker News with 559 points and 217 comments. The model name suggests it is part of the Qwen 3 generation, with a focus on agent capabilities.

Frontier Model Releases Open Weights Progress Alibaba Qwen Qwen2.5-Max +1 more

4Hugging Face Blog·May 19, 2026·source ↗

The 4 Things Qwen-3's Chat Template Teaches Us

A Hugging Face blog post performs a deep dive into the chat template design of Qwen-3, examining the technical choices made in its prompt formatting and conversation structure. The analysis surfaces lessons about how chat templates encode model behavior, reasoning modes, and tool-use conventions. As a tier-2 commentary piece, it provides practical implementation guidance for developers integrating Qwen-3 into applications.

Frontier Model Releases Enterprise Deployment Patterns Alibaba Hugging Face Qwen3 +1 more

6Hacker News·May 18, 2026·source ↗

Qwen 3.7 Preview Announced by Alibaba

Alibaba's Qwen team has announced a preview of Qwen 3.7, the next iteration in their Qwen 3 model series. The announcement appeared on Twitter/X and generated notable community discussion on Hacker News with 179 points and 67 comments. Specific capability details and model specifications are not available from this source alone.

Frontier Model Releases Open Weights Progress Qwen 3.7 Alibaba Qwen Team +1 more

4Qwen Research·May 18, 2026·source ↗

OFASys: Multitask Multimodal Learning Framework from Alibaba/Qwen

Alibaba's Qwen team released OFASys, an open-source framework designed to simplify multimodal multitask learning, building on their earlier OFA unified pretrained model. The system aims to reduce engineering friction in setting up multi-task, multi-modal training pipelines, including data batching and training stability. It is positioned as infrastructure for building generalist AI models with minimal code overhead.

Agent and Tool Ecosystem Multimodal Progress Alibaba OFA Qwen +1 more

4Qwen Research·May 18, 2026·source ↗

Introducing the Qwen Series: Overview of Alibaba's Open-Source LLM Journey

Alibaba's Qwen team published a retrospective introduction to the Qwen series of large language models, four months after the initial Qwen-7B open-source release. The post consolidates links to their paper, GitHub, Hugging Face, and ModelScope repositories, and outlines the team's objectives for the open-source LLM program. It serves as a canonical reference point for the Qwen model family's public positioning.

Frontier Model Releases Open Weights Progress Alibaba Qwen-7B Qwen +2 more

6Qwen Research·May 18, 2026·source ↗

Introducing Qwen-VL-Plus and Qwen-VL-Max: Upgraded Multimodal Models from Alibaba

Alibaba's Qwen team has launched two enhanced versions of their multimodal model, Qwen-VL-Plus and Qwen-VL-Max, building on the open-sourced Qwen-VL released in September 2023. Key improvements include substantially boosted image reasoning capabilities, enhanced detail recognition and text extraction from images, and support for high-definition images exceeding one million pixels across various aspect ratios. The upgrades represent a significant step forward in the Qwen-VL series' generalization and visual understanding capabilities.

Frontier Model Releases Open Weights Progress Qwen-VL Qwen-VL-Max Alibaba +2 more

6Qwen Research·May 18, 2026·source ↗

Qwen1.5-32B: Alibaba's 30B-Parameter Capstone for the Qwen1.5 Series

Alibaba's Qwen team released Qwen1.5-32B, a ~30 billion parameter open-weights language model positioned as the capstone of the Qwen1.5 series. The model targets the emerging consensus around 30B parameters as an optimal balance between performance, memory footprint, and inference efficiency. It is released alongside code on GitHub, weights on HuggingFace and ModelScope, and an interactive demo.

Frontier Model Releases Open Weights Progress Qwen1.5-72B DBRX Qwen1.5-32B +4 more

5Qwen Research·May 18, 2026·source ↗

CodeQwen1.5: Alibaba's Open-Source Code LLM Release

Alibaba's Qwen team released CodeQwen1.5, an open-source large language model specialized for code generation and programming assistance. The release is positioned as a transparent, accessible alternative to proprietary coding assistants like GitHub Copilot, addressing concerns around cost, privacy, security, and copyright. The model is available on GitHub, HuggingFace, and ModelScope.

Open Weights Progress Agent and Tool Ecosystem CodeQwen1.5 Alibaba Qwen +3 more

7Qwen Research·May 18, 2026·source ↗

Qwen1.5-110B: Alibaba Releases First 100B+ Model in Qwen1.5 Series

Alibaba's Qwen team released Qwen1.5-110B, their first open-weights model exceeding 100 billion parameters. The model claims comparable performance to Meta's Llama-3-70B on base model benchmarks, with strong results on MT-Bench and AlpacaEval 2 chat evaluations. The release follows a wave of large open-source models exceeding 100B parameters from various organizations.

Frontier Model Releases Evaluation and Benchmarking MT-Bench Meta-Llama-3-70B Alibaba +3 more

7Qwen Research·May 18, 2026·source ↗

Generalizing an LLM from 8k to 1M Context using Qwen-Agent

Alibaba's Qwen team describes an agent built on Qwen2 (8k native context) that processes documents up to 1M tokens by decomposing retrieval and reasoning tasks, reportedly outperforming both RAG pipelines and native long-context models. The agent framework was also used to generate synthetic training data for fine-tuning new long-context Qwen models, creating a self-improvement loop. This positions agent-based context extension as a practical alternative to architectural long-context training.

Long Context Evolution Open Weights Progress RAG Qwen2.5 Alibaba +2 more

6Qwen Research·May 18, 2026·source ↗

Introducing Qwen2-Math: Math-Specialized LLMs from Alibaba's Qwen Team

Alibaba's Qwen team has released Qwen2-Math and Qwen2-Math-Instruct, a series of math-specialized large language models built on the Qwen2 architecture. The models are designed to enhance arithmetic and mathematical reasoning capabilities in LLMs. The initial release supports English only, with bilingual English/Chinese versions announced as forthcoming.

Frontier Model Releases Evaluation and Benchmarking Qwen2-Math-Instruct Qwen2.5 Alibaba +2 more

6Qwen Research·May 18, 2026·source ↗

Qwen2-Audio: Multimodal Audio-Language Model Release

Alibaba's Qwen team releases Qwen2-Audio, the successor to Qwen-Audio, capable of accepting both audio and text inputs and generating text outputs. The model is positioned as a step toward AGI by extending large language model capabilities to audio modalities. It is released with accompanying paper, GitHub repository, and model weights on Hugging Face and ModelScope.

Frontier Model Releases Open Weights Progress Alibaba Qwen Hugging Face +3 more

7Qwen Research·May 18, 2026·source ↗

Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding

Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.

Frontier Model Releases Evaluation and Benchmarking Qwen2.5-VL RealWorldQA DocVQA +6 more

8Qwen Research·May 18, 2026·source ↗

Qwen2.5-LLM: Alibaba releases open-weight language models from 0.5B to 72B

Alibaba's Qwen team releases the Qwen2.5 series of decoder-only dense language models, open-sourcing seven variants spanning 0.5B to 72B parameters. The release targets production use cases in the 10-30B range and mobile deployments at 3B scale. This represents a significant expansion of the open-weights frontier from a Tier 1 Chinese AI lab.

Frontier Model Releases Open Weights Progress Qwen2.5 Alibaba Qwen Team +4 more

8Qwen Research·May 18, 2026·source ↗

Qwen2.5: Large-Scale Open-Source Foundation Model Family Release

Alibaba's Qwen team has released Qwen2.5, described as potentially the largest open-source model release in history, following three months of development after Qwen2. The release encompasses a family of foundation models with improvements in knowledge and reasoning capabilities. The announcement targets developers who have been building on Qwen2 and incorporates feedback from that community.

Frontier Model Releases Open Weights Progress Qwen2.5 Alibaba Hugging Face +2 more

7Qwen Research·May 18, 2026·source ↗

Qwen2.5-Turbo Extends Context Length to 1M Tokens

Alibaba's Qwen team has released Qwen2.5-Turbo, extending the model's context window from 128K to 1 million tokens (approximately 1 million English words). The update includes optimizations for both model capabilities and inference performance at extreme context lengths. The model is available via API and through HuggingFace and ModelScope demos.

Long Context Evolution Frontier Model Releases Qwen2.5 Alibaba ModelScope +3 more

7Qwen Research·May 18, 2026·source ↗

QwQ-32B-Preview: Alibaba's Qwen Reasoning Model with Deep Reflection Capabilities

Alibaba's Qwen team has released QwQ-32B-Preview, a 32-billion parameter model designed for deep reasoning across mathematics, code, and general knowledge. The model is positioned as a reasoning-focused system that emphasizes uncertainty and iterative questioning as core design principles. It is available on GitHub, Hugging Face, ModelScope, and via a demo interface.

Frontier Model Releases Evaluation and Benchmarking Alibaba QwQ-32B-Preview Qwen +3 more