6arXiv cs.LG (Machine Learning)·18d ago

Iteris: Agentic Research Loops for Computational Mathematics

Iteris is an agentic AI research system designed to tackle open problems in computational mathematics, combining numerical experimentation, adversarial construction, and algorithm design within an automated loop. Applied to two open problems from a Simons Workshop collection, Iteris produced numerical evidence, constructions, and proof drafts that—after expert review—yielded verified results: a phase diagram comparing conjugate gradient vs. randomized coordinate descent, and a counterexample to QR factorization with column pivoting under low coherence. The paper argues that agentic AI can meaningfully participate in mathematical research workflows while human validation remains essential.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem Iteris randomized coordinate descent QR factorization with column pivoting Simons Workshop on Computational Mathematics arXiv:2602.05394 conjugate gradient

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7The Batch·17d ago·source ↗

Google's Aletheia agent uses Gemini 3 Deep Think to generate novel solutions to unsolved Erdős problems

Google researchers introduced Aletheia, an agentic workflow using Gemini 3 Deep Think that generates, verifies, and revises solutions to previously unsolved mathematical problems. Applied to Erdős problems, Aletheia produced 13 correct solutions out of 200 evaluated, with 4 being genuinely novel contributions not found in existing literature. The announcement also reveals Gemini 3 Deep Think's benchmark performance: 48.4% on HLE, 84.6% on ARC-AGI-2, and 93.8% on GPQA Diamond. The system demonstrates both the promise and current limitations of AI-assisted mathematical research, with a 6.5% correct-under-intended-interpretation rate on a hard problem set.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro Gemini Deep Think Tony Feng +9 more

6Google Deepmind Blog·1mo ago·source ↗

Accelerating discovery with the AI for Math Initiative

Google DeepMind has announced the AI for Math Initiative, a collaborative effort bringing together leading research institutions to advance the use of AI in mathematical research. The initiative aims to pioneer AI-driven approaches to mathematical discovery. The announcement comes from a Tier 1 source but the body text is sparse, providing limited technical detail about specific methods, models, or partner institutions involved.

Frontier Model Releases Evaluation and Benchmarking AI for Math Initiative Google DeepMind +1 more

6arXiv · cs.AI·25d ago·source ↗

VeriTrace: Cognitive-Graph Framework with Explicit Regulatory Loops for Deep Research Agents

VeriTrace introduces a cognitive-graph framework for deep research agents that replaces implicit LLM reasoning over intermediate representations with three explicit regulatory loops: interpretive update, deviation feedback, and schema revision. The system addresses contamination and error propagation in evolving mental models during complex multi-step research tasks. Using Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench Insight and 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DeepResearch Bench.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 cognitive-graph DeepResearch Bench +4 more

4Ai Snake Oil·1mo ago·source ↗

Can AI automate computational reproducibility?

This commentary introduces a new benchmark aimed at measuring AI's ability to automate computational reproducibility in scientific research. The piece examines whether AI systems can reliably re-execute and validate scientific computations, a key bottleneck in research integrity. It frames reproducibility automation as a concrete, measurable capability for evaluating AI's impact on science.

Evaluation and Benchmarking Agent and Tool Ecosystem Normal Tech / AI Snake Oil AI Reproducibility Benchmark

5Latent Space·17d ago·source ↗

Latent Space profiles Axiom Math on verified generation and compounding intelligence

Latent Space interviews Carina Hong of Axiom Math, a company focused on formal verification applied to AI-generated mathematics. The discussion centers on 'verified generation' and 'compounding intelligence' as frameworks for scaling AI reasoning beyond informal, unverified outputs. The piece is relevant to the growing intersection of formal methods, mathematical reasoning, and AI capability development.

Frontier Model Releases Evaluation and Benchmarking Carina Hong Axiom Math Latent Space

6The Batch·1mo ago·source ↗

Data Points: Thinking Machines Interaction Model, ERNIE 5.1, Co-Mathematician, RL Conductor, and More

This edition of The Batch covers five notable AI developments: Thinking Machines' research preview of an 'interaction model' with a 200ms micro-turn multimodal architecture; Baidu's ERNIE 5.1, a compressed derivative of ERNIE 5.0 using only 6% of typical pre-training compute; Google DeepMind's Co-Mathematician collaborative workbench reaching 48% on FrontierMath Tier 4; a 7B RL Conductor model that orchestrates multi-agent workflows via reinforcement learning; and Google's Magic Pointer cursor system powered by Gemini. Secondary items include GitHub Copilot pricing restructuring ahead of usage-based billing.

Training Infrastructure Frontier Model Releases Thinking Machines SGLang GitHub +21 more

4arXiv · cs.CL·4d ago·source ↗

Informath: Symbolic informalization for converting formal proofs to fluent natural language

The paper introduces Informath, a project for symbolic informalization — converting formally verified mathematics into readable natural language without loss of precision. The architecture uses Dedukti as an interlingua hub connecting proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) for multilingual natural language generation. The work is relevant to AI-assisted formal verification pipelines where autoformalization produces machine-checked proofs that need to be made human-interpretable.

Informath Grammatical Framework Dedukti +3 more

5Import Ai·1mo ago·source ↗

Import AI 455: AI systems are about to start building themselves

Import AI issue 455 covers the emerging trend of AI systems automating AI research, framing it as a first step toward recursive self-improvement. The commentary synthesizes recent developments suggesting AI is beginning to participate meaningfully in its own development pipeline. As a tier-2 newsletter, this represents curated analysis of frontier AI research directions rather than primary reporting.

Frontier Model Releases AI Safety Research Recursive Self-Improvement automated AI research Jack Clark +2 more