Researchers introduce 'fuzzy-function programming' and its instantiation Program-as-Weights (PAW), a paradigm where a 4B compiler model converts natural-language function specifications into parameter-efficient adapters for a frozen lightweight interpreter. A 0.6B Qwen3 interpreter running PAW programs matches the performance of direct prompting with Qwen3-32B while using ~1/50th the inference memory and running at 30 tokens/s on a MacBook M3. The approach reframes large foundation models as one-time 'tool builders' rather than per-input solvers, targeting tasks like log triage, JSON repair, and intent-based ranking that resist rule-based implementation. The authors also release FuzzyBench, a 10M-example training dataset.
Alibaba's Qwen team released Qwen1.5-32B, a ~30 billion parameter open-weights language model positioned as the capstone of the Qwen1.5 series. The model targets the emerging consensus around 30B parameters as an optimal balance between performance, memory footprint, and inference efficiency. It is released alongside code on GitHub, weights on HuggingFace and ModelScope, and an interactive demo.
Alibaba's Qwen team releases the Qwen2.5 series of decoder-only dense language models, open-sourcing seven variants spanning 0.5B to 72B parameters. The release targets production use cases in the 10-30B range and mobile deployments at 3B scale. This represents a significant expansion of the open-weights frontier from a Tier 1 Chinese AI lab.
A new arXiv paper evaluates six LLM-based coding agents on four esoteric programming languages (including Brainfuck and Befunge-98), finding that the strongest agents—Claude Opus 4.6 and GPT-5.4 xhigh—often avoid writing the target language directly, instead generating it via Python metaprograms. Forbidding this strategy causes large performance drops, and text guidance alone does not transfer the capability to weaker models, though sharing Opus-derived Python helper code does sharply improve mid-tier agents. The study reveals capability stratification that mainstream benchmarks like SWE-Bench Verified compress into narrow bands, suggesting frontier agents succeed by constructing and debugging working models of unfamiliar environments rather than pattern-matching to training data.
DeepSWIP introduces a single-world counterfactual semantics for DeepProbLog, enabling causal inference over neurosymbolic programs that combine neural perception with probabilistic logic. The approach uses neural materialization to reduce neural predicates to standard ProbLog choices, then applies Single World Intervention Programs (SWIPs) and weighted model counting to compute exact counterfactuals from a single transformed program. Experiments on MPI3D validate the method against a DeepTwin construction across 12,000 queries and show a 2.14× inference speedup, while a SUMO HOV experiment demonstrates that neural calibration degradation biases plug-in causal estimates and that a correctly scoped AIPW estimator removes most first-order bias.
Alibaba's Qwen team has open-sourced the Qwen2.5-Coder family of code-specialized language models, with the flagship 32B-Instruct variant claiming state-of-the-art performance among open-source code models and parity with GPT-4o on coding benchmarks. The release spans multiple model sizes, expanding on previously released smaller variants. The models are described as combining strong coding ability with general reasoning and mathematical skills.
Alibaba released the Qwen3.5 family of eight open-weights vision-language models ranging from 0.8B to 397B parameters, built on a mixture-of-experts architecture with mixed attention and Gated DeltaNet layers. The flagship Qwen3.5-397B-A17B outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks, while the 9B model surpasses OpenAI's gpt-oss-120B on most language tasks. Open weights are available under Apache 2.0, with hosted agentic variants (Qwen3.5-Plus, Qwen3.5-Flash) available via Alibaba Cloud. The release is notable for strong small-model efficiency and comes amid reported team departures following the Qwen3 rollout.
Researchers propose a pipeline that approximates transformer attention heads with executable Python programs generated by a language model, then re-ranked by held-out predictive accuracy. Applied to GPT-2, TinyLlama-1.1B, and Llama-3B, fewer than 1,000 programs reproduce attention patterns with >75% average IoU similarity on TinyStories. Replacing 25% of attention heads with programmatic surrogates incurs only a 16% average perplexity increase while preserving downstream QA performance, demonstrating a path toward symbolic transparency in neural models.
Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.