7arXiv cs.LG (Machine Learning)·44h ago

Program-as-Weights: compiling natural-language specs into compact neural adapters for local execution

Researchers introduce 'fuzzy-function programming' and its instantiation Program-as-Weights (PAW), a paradigm where a 4B compiler model converts natural-language function specifications into parameter-efficient adapters for a frozen lightweight interpreter. A 0.6B Qwen3 interpreter running PAW programs matches the performance of direct prompting with Qwen3-32B while using ~1/50th the inference memory and running at 30 tokens/s on a MacBook M3. The approach reframes large foundation models as one-time 'tool builders' rather than per-input solvers, targeting tasks like log triage, JSON repair, and intent-based ranking that resist rule-based implementation. The authors also release FuzzyBench, a 10M-example training dataset.

Open Weights Progress Inference Economics Agent and Tool Ecosystem Qwen3.5-0.8B Program-as-Weights FuzzyBench Qwen3 32B FuzzyBench Program-as-Weights

Related guides (3)

Open Weights ProgressTopic guide

Open Weights Progress: How Free AI Models Caught Up to the Frontier

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Hidden Cost Battle Shaping AI

Read asBeginner In-depth

Related events (8)

6Qwen Research·May 18, 2026·source ↗

Qwen1.5-32B: Alibaba's 30B-Parameter Capstone for the Qwen1.5 Series

Alibaba's Qwen team released Qwen1.5-32B, a ~30 billion parameter open-weights language model positioned as the capstone of the Qwen1.5 series. The model targets the emerging consensus around 30B parameters as an optimal balance between performance, memory footprint, and inference efficiency. It is released alongside code on GitHub, weights on HuggingFace and ModelScope, and an interactive demo.

Frontier Model Releases Open Weights Progress Qwen1.5-72B DBRX Qwen1.5-32B +4 more

8Qwen Research·May 18, 2026·source ↗

Qwen2.5-LLM: Alibaba releases open-weight language models from 0.5B to 72B

Alibaba's Qwen team releases the Qwen2.5 series of decoder-only dense language models, open-sourcing seven variants spanning 0.5B to 72B parameters. The release targets production use cases in the 10-30B range and mobile deployments at 3B scale. This represents a significant expansion of the open-weights frontier from a Tier 1 Chinese AI lab.

Frontier Model Releases Open Weights Progress Qwen2.5 Alibaba Qwen Team +4 more

6arXiv · cs.AI·Jun 10, 2026·source ↗

Frontier coding agents use metaprogramming to handle esoteric programming languages

A new arXiv paper evaluates six LLM-based coding agents on four esoteric programming languages (including Brainfuck and Befunge-98), finding that the strongest agents—Claude Opus 4.6 and GPT-5.4 xhigh—often avoid writing the target language directly, instead generating it via Python metaprograms. Forbidding this strategy causes large performance drops, and text guidance alone does not transfer the capability to weaker models, though sharing Opus-derived Python helper code does sharply improve mid-tier agents. The study reveals capability stratification that mainstream benchmarks like SWE-Bench Verified compress into narrow bands, suggesting frontier agents succeed by constructing and debugging working models of unfamiliar environments rather than pattern-matching to training data.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Claude Opus 4.6 SWE-Bench Verified +8 more

5arXiv · cs.AI·Jun 19, 2026·source ↗

DeepSWIP: Counterfactual reasoning for neural probabilistic logic programs via quotient-WMC

DeepSWIP introduces a single-world counterfactual semantics for DeepProbLog, enabling causal inference over neurosymbolic programs that combine neural perception with probabilistic logic. The approach uses neural materialization to reduce neural predicates to standard ProbLog choices, then applies Single World Intervention Programs (SWIPs) and weighted model counting to compute exact counterfactuals from a single transformed program. Experiments on MPI3D validate the method against a DeepTwin construction across 12,000 queries and show a 2.14× inference speedup, while a SUMO HOV experiment demonstrates that neural calibration degradation biases plug-in causal estimates and that a correctly scoped AIPW estimator removes most first-order bias.

Evaluation and Benchmarking AI Safety Research DeepSWIP MPI3D DeepProbLog +1 more

8Qwen Research·May 18, 2026·source ↗

Qwen2.5-Coder Series Open-Sourced: 32B Model Claims SOTA, Matches GPT-4o on Coding

Alibaba's Qwen team has open-sourced the Qwen2.5-Coder family of code-specialized language models, with the flagship 32B-Instruct variant claiming state-of-the-art performance among open-source code models and parity with GPT-4o on coding benchmarks. The release spans multiple model sizes, expanding on previously released smaller variants. The models are described as combining strong coding ability with general reasoning and mathematical skills.

Frontier Model Releases Evaluation and Benchmarking Qwen2.5-Coder-32B-Instruct GPT-4o OpenAI +3 more

7The Batch·Jun 2, 2026·source ↗

Alibaba releases Qwen3.5 open-weights vision-language model family with MoE architecture across eight sizes

Alibaba released the Qwen3.5 family of eight open-weights vision-language models ranging from 0.8B to 397B parameters, built on a mixture-of-experts architecture with mixed attention and Gated DeltaNet layers. The flagship Qwen3.5-397B-A17B outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks, while the 9B model surpasses OpenAI's gpt-oss-120B on most language tasks. Open weights are available under Apache 2.0, with hosted agentic variants (Qwen3.5-Plus, Qwen3.5-Flash) available via Alibaba Cloud. The release is notable for strong small-model efficiency and comes amid reported team departures following the Qwen3 rollout.

Frontier Model Releases Open Weights Progress GPT-5.2 Alibaba Cloud Model Studio Claude Opus 4.6 +10 more

6arXiv · cs.LG·Jun 18, 2026·source ↗

Program synthesis used to reverse-engineer transformer attention heads with executable Python surrogates

Researchers propose a pipeline that approximates transformer attention heads with executable Python programs generated by a language model, then re-ranked by held-out predictive accuracy. Applied to GPT-2, TinyLlama-1.1B, and Llama-3B, fewer than 1,000 programs reproduce attention patterns with >75% average IoU similarity on TinyStories. Replacing 25% of attention heads with programmatic surrogates incurs only a 16% average perplexity increase while preserving downstream QA performance, demonstrating a path toward symbolic transparency in neural models.

Evaluation and Benchmarking AI Safety Research Llama 3.2 GPT-2 Explaining Attention with Program Synthesis +2 more

6The Batch·Jun 1, 2026·source ↗

GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain

Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index SWE-bench +9 more