6arXiv cs.AI (Artificial Intelligence)·5d ago

LLM-guided search framework discovers new quantum LDPC code families via structured concept evolution

Researchers introduce Structured Concept Evolution (SCE), a framework pairing an LLM with an algebraic mutation grammar to discover lifted-product quantum LDPC code families. The system evolves structured algebraic specifications rather than asking the LLM to design codes from scratch, enabling discovery of both abelian and non-abelian code families competitive with standard designs like bivariate-bicycle codes. Results are achieved using lightweight models (GPT-5.4-mini and GPT-5.4-nano), suggesting LLM-guided combinatorial search can be effective for hard discrete design problems in quantum error correction.

Agent and Tool Ecosystem OpenAI Structured Concept Evolution GPT-5.4 mini GPT-5.4 nano

Related guides (2)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

8arXiv · cs.AI·1mo ago·source ↗

Large-Scale Evaluation of LLM-Driven Formal Proof Search on Open Mathematical Problems

Researchers present the first large-scale evaluation of LLM-based formal proof search on genuinely open mathematical problems, using Lean as a verification backend. Their most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The study also compares agent architectures, finding that more sophisticated designs outperform simple generate-and-verify loops on the hardest problems.

Frontier Model Releases Evaluation and Benchmarking large language models Erdős Problems OEIS Conjectures +3 more

5arXiv · cs.LG·1mo ago·source ↗

SchGen: LLM-Based PCB Schematic Generation via Semantic Code Representations

SchGen is presented as the first large language model system capable of generating editable PCB schematics from natural-language descriptions. The approach introduces a semantically grounded code representation that replaces verbose, geometry-heavy schematic formats with relative placement and pin-name-based wiring primitives, reframing the problem as a semantics-driven matching task. A large-scale dataset was constructed via a human-agent collaborative pipeline converting open-source hardware designs into the new representation. Experiments show SchGen outperforms alternative representations and larger general-purpose LLMs on wire connectivity accuracy and functional correctness.

Frontier Model Releases Agent and Tool Ecosystem semantic code representation human-agent collaborative pipeline wire connectivity accuracy +2 more

4arXiv · cs.AI·19d ago·source ↗

SECDA-DSE: LLM-guided design space exploration for FPGA accelerator generation

SECDA-DSE is a framework that integrates LLMs into the SECDA hardware-software co-design ecosystem to automate design space exploration (DSE) of FPGA-based AI accelerators. The system combines a structured architecture candidate generator with an LLM Stack using retrieval-augmented generation and chain-of-thought prompting, plus an iterative feedback loop. Evaluation demonstrates end-to-end synthesis and execution of three accelerator designs on real FPGA hardware, with results showing the approach captures kernel-specific compute/memory trade-offs while reducing manual design effort.

Training Infrastructure Agent and Tool Ecosystem chain-of-thought prompting SECDA-DSE Retrieval-Augmented Generation

6arXiv · cs.CL·21d ago·source ↗

LLM-guided MAP-Elites evolution improves medical decision pipelines at inference time

Researchers propose using LLM-guided MAP-Elites evolutionary search as an inference-time alternative to fine-tuning for adapting LLMs to clinical workflows, formulating triage, consultation, and image classification as evolutionary searches over executable artifacts. Across three medical settings, evolved programs substantially outperform manually designed baselines: triage accuracy improves from 77.3% to 87.1% and emergency recall from 0.60 to 0.97, with gains also shown on MIMIC-ESI, iCRAFTMD, and PneumoniaMNIST. The approach works across Llama-3, Qwen-3.5, and Gemma-4 backbones and produces interpretable program-level mechanisms rather than superficial prompt changes.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemma-4 E4B-it MIMIC-ESI iCRAFTMD +6 more

6arXiv · cs.CL·24d ago·source ↗

MLEvolve: Self-evolving multi-agent framework for automated ML algorithm discovery

MLEvolve is a new LLM-based multi-agent framework for end-to-end machine learning algorithm discovery, addressing limitations of existing MLE agents including information isolation and memoryless search. The system introduces Progressive MCGS (a graph-extended tree search), Retrospective Memory for experience accumulation, and decoupled strategic planning from code generation. Evaluated on MLE-Bench, it achieves state-of-the-art medal and valid submission rates within a 12-hour budget, and also outperforms AlphaEvolve on mathematical algorithm optimization tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem MLEvolve MLE-bench Progressive MCGS +3 more

5arXiv · cs.CL·28d ago·source ↗

PowerCodeBench: Knowledge Boundary Probing and Intervention for LLM-Based Power System Code Generation

This paper introduces PowerCodeBench, an execution-validated benchmark for evaluating LLMs on power-system simulation code generation using the pandapower library. The authors identify that failures are dominated by API-knowledge boundary errors (hallucinated function names, misused parameters) rather than reasoning failures, and propose a boundary-aware intervention combining API demand estimation with targeted documentation injection. Evaluated across ten open-weight models (1.5B–480B) and four commercial APIs on 2,000 tasks, the intervention yields 32–56 accuracy point improvements while using only 41% of baseline prompt-token cost. Open-weight models in the 70B–120B range match commercial mid-tier accuracy, with Llama-3.1-405B and Qwen3-Coder-480B leading.

Evaluation and Benchmarking Open Weights Progress pandapower Meta Llama 3.1 405B Alibaba +7 more

4arXiv · cs.CL·6d ago·source ↗

P4IR framework uses SFT + GRPO to improve LLM-based automated building code compliance

Researchers introduce P4IR, a two-stage framework combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to improve LLM accuracy in automated code compliance (ACC) for building regulations. The approach reduces tree edit distance and token-level Levenshtein distance by up to 23.8% and 38.6% respectively versus SFT baselines, and outperforms Claude Opus/Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7 in zero-shot settings. The work targets a narrow but practically important domain where LLM hallucinations carry real regulatory consequences.

Enterprise Deployment Patterns Alignment and RLHF GPT-5.2 Claude Opus 4.6 Claude Sonnet 4.5 +4 more

4arXiv · cs.CL·1mo ago·source ↗

LLM-Based Grammar Adaptation for Metamodel-Grammar Co-Evolution in Model-Driven Engineering

This paper proposes using LLMs to automate grammar adaptation when metamodels evolve in model-driven engineering, replacing tedious manual work and outperforming rule-based methods. Evaluated on six real-world Xtext DSLs using Claude Sonnet 4.5, ChatGPT 5.1, and Gemini 3, all three LLMs achieved 100% adaptation consistency on test DSLs versus 62-84% for rule-based approaches. A longitudinal study on QVTo showed LLMs successfully reused learned adaptations across all evolution steps without manual editing. However, on large-scale grammars (EAST-ADL, 297 rules), LLM adaptation consistency dropped well below 90%, revealing a scalability limitation.

Agent and Tool Ecosystem Xtext Claude Sonnet 4.5 QVTo +3 more