6arXiv cs.CL (Computation and Language)·22d ago

COMPOSE: Dual-Graph Framework for Generating Future Mathematical Theorems from Citations and Formal Structure

COMPOSE is a framework that generates plausible future mathematical theorem-like claims by conditioning a language model on both a scientific citation graph and a formal theorem dependency graph simultaneously. The authors construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, plus a benchmark of 47K future papers from 2024–2025. Experiments show COMPOSE outperforms baselines on retrieval to real future papers and LLM-judge evaluation, producing more grounded and mathematically richer outputs. The work advances AI-assisted mathematical reasoning by combining informal scientific context with formal proof structure.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem COMPOSE Mathlib grounded future mathematical generation ArXiv dual-graph framework

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

Generative Language Modeling for Automated Theorem Proving

OpenAI published research on applying generative language models to automated theorem proving, an early exploration of using neural language models to assist formal mathematical reasoning. The work investigates how language models can generate proof steps or complete proofs in formal systems. This represents an early milestone in AI-assisted mathematical reasoning, predating later work like GPT-f and subsequent theorem-proving systems.

Frontier Model Releases Evaluation and Benchmarking automated theorem proving generative language modeling GPT-f +1 more

8arXiv · cs.AI·15d ago·source ↗

Goedel-Architect achieves state-of-the-art formal theorem proving with blueprint-based agentic framework

Goedel-Architect is an agentic framework for formal theorem proving in Lean 4 that uses blueprint generation — a dependency graph of definitions and lemmas — rather than recursive decomposition, enabling parallel lemma closure and global refinement. Built on DeepSeek-V4-Flash (284B-A13B), it achieves 99.2% pass@1 on MiniF2F-test and 75.6% on PutnamBench, scaling to 100% on MiniF2F, 88.8% on PutnamBench, and 4/6 on IMO 2025 when seeded with natural-language proofs. The authors claim state-of-the-art performance for an open-source pipeline at up to 500x lower cost than comparable systems.

Frontier Model Releases Evaluation and Benchmarking MiniF2F DeepSeek-V4-Flash Goedel-Architect +3 more

6arXiv · cs.AI·5d ago·source ↗

Formal theory shows infinite trivial output is provably necessary for AI systems generating valuable mathematics

A new arXiv paper models AI-assisted formal mathematics generation as a nested language-generation-in-the-limit problem, using a proof checker as a membership oracle and an adversarial enumeration of the mathematical literature as the signal for 'valuable' content. The authors prove a sharp dichotomy: generators emitting only finitely many trivial (correct but worthless) statements achieve at most α/2 coverage of unseen valuable mathematics, while allowing an infinite (but asymptotically vanishing) stream of trivia raises the optimum to 1−α/2. The central result is that a perfect verifier cannot substitute for mathematical taste, and the flood of certified-but-trivial output from AI proof systems is a provable mathematical necessity, not an engineering failure. The work formalizes the gap between formal verifiability and mathematical value, which is increasingly the binding constraint as AI-proof-assistant systems scale.

Evaluation and Benchmarking AI Safety Research Angluin's condition Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit Language Generation in the Limit

4arXiv · cs.CL·4d ago·source ↗

Informath: Symbolic informalization for converting formal proofs to fluent natural language

The paper introduces Informath, a project for symbolic informalization — converting formally verified mathematics into readable natural language without loss of precision. The architecture uses Dedukti as an interlingua hub connecting proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) for multilingual natural language generation. The work is relevant to AI-assisted formal verification pipelines where autoformalization produces machine-checked proofs that need to be made human-interpretable.

Informath Grammatical Framework Dedukti +3 more

5arXiv · cs.CL·11d ago·source ↗

IS-CoT framework addresses long-form generation collapse in LLMs via interleaved structural thinking

Researchers introduce IS-CoT (Interleaved Structural Chain-of-Thought), a framework that embeds a dynamic Plan-Write-Reflect cycle into LLM generation to overcome severe length collapse observed in reasoning-enhanced models for open-ended writing tasks beyond 2,000 words. The authors construct a multi-teacher training dataset of interleaved reasoning traces and train IS-Writer-8B, which achieves state-of-the-art results on LongBench-Write, outperforming DeepSeek-V3.2 by 3.08 points. The work identifies static hierarchical planning as a root cause of long-form degradation and proposes an in-model alternative to external agentic workflows.

Long Context Evolution Evaluation and Benchmarking DeepSeek V4 LongBench-Write IS-Writer-8B +1 more

8arXiv · cs.AI·29d ago·source ↗

Large-Scale Evaluation of LLM-Driven Formal Proof Search on Open Mathematical Problems

Researchers present the first large-scale evaluation of LLM-based formal proof search on genuinely open mathematical problems, using Lean as a verification backend. Their most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The study also compares agent architectures, finding that more sophisticated designs outperform simple generate-and-verify loops on the hardest problems.

Frontier Model Releases Evaluation and Benchmarking large language models Erdős Problems OEIS Conjectures +3 more

5Latent Space·17d ago·source ↗

Latent Space profiles Axiom Math on verified generation and compounding intelligence

Latent Space interviews Carina Hong of Axiom Math, a company focused on formal verification applied to AI-generated mathematics. The discussion centers on 'verified generation' and 'compounding intelligence' as frameworks for scaling AI reasoning beyond informal, unverified outputs. The piece is relevant to the growing intersection of formal methods, mathematical reasoning, and AI capability development.

Frontier Model Releases Evaluation and Benchmarking Carina Hong Axiom Math Latent Space

7arXiv · cs.AI·22d ago·source ↗

Bounding Compositional Incoherence in Multi-Component LLM Agents

This paper formalizes a failure mode in multi-component LLM agent systems where individual components are locally probabilistically coherent but their composition violates basic probability axioms. The authors introduce the 'compositional residual' (eps*) as a runtime-computable measure of this incoherence, finding it positive in 33–94% of ensemble cliques across 1,876 tested configurations on a four-LLM panel. A hierarchical Boyle-Dykstra projection is proposed as a deterministic repair, and an anytime-valid e-process enables sequential monitoring. Notably, three intuitive LLM-side mitigations—retrieval, partition-aware prompting, and aggregator-LLM—each fail or regress.

Evaluation and Benchmarking AI Safety Research Compositional Residual (eps*)Proportional Allocation Rule Multi-Component LLM Agent +4 more