7arXiv cs.CL (Computation and Language)·23d ago

Bidirectional Evolutionary Search (BES) for Self-Improving Language Models

BES is a search framework that combines forward evolutionary candidate generation with backward goal decomposition to address limitations of best-of-N and tree search methods. Forward search uses recombination operators to escape the narrow entropy shell of autoregressive expansion, while backward search recursively decomposes tasks into checkable subgoals for dense intermediate feedback. Theoretical analysis shows evolutionary operators can escape entropy-shell confinement and backward search can exponentially reduce required samples. Experiments demonstrate consistent gains on post-training tasks where mainstream algorithms fail, and superior performance on three open problem-solving benchmarks at inference time.

Evaluation and Benchmarking Inference Economics Agent and Tool Ecosystem Alignment and RLHF Embodied Minds Lab Best-of-N Sampling tree search Bidirectional Evolutionary Search BES (trained models)

Related guides (4)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost Structure of Running AI Models in Production

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7arXiv · cs.CL·19d ago·source ↗

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

SCOPE is a data-free self-play framework for training language models on open-ended tasks without external supervision or frontier-model judges. It co-evolves two policies—a Challenger that generates document-grounded tasks and a Solver that answers via multi-turn retrieval—using a frozen copy of the initial model as a self-judge that writes task-specific rubrics. Across three 7-8B models (Qwen2.5, Qwen3, OLMo-3), SCOPE achieves up to +10.4 points on eight open-ended benchmarks and +13.8 points on seven held-out short-form QA benchmarks, matching or exceeding GRPO trained on ~9K curated prompts. Ablations identify rubric generation quality as the primary bottleneck for self-judging.

Evaluation and Benchmarking Open Weights Progress SCOPE Qwen2.5 self-play +5 more

6arXiv · cs.CL·15d ago·source ↗

MLEvolve: Self-evolving multi-agent framework for automated ML algorithm discovery

MLEvolve is a new LLM-based multi-agent framework for end-to-end machine learning algorithm discovery, addressing limitations of existing MLE agents including information isolation and memoryless search. The system introduces Progressive MCGS (a graph-extended tree search), Retrospective Memory for experience accumulation, and decoupled strategic planning from code generation. Evaluated on MLE-Bench, it achieves state-of-the-art medal and valid submission rates within a 12-hour budget, and also outperforms AlphaEvolve on mathematical algorithm optimization tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem MLEvolve MLE-bench Progressive MCGS +3 more

5Openai Blog·1mo ago·source ↗

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

OpenAI published research showing that evolution strategies (ES), a decades-old optimization technique, can match standard reinforcement learning performance on benchmarks like Atari and MuJoCo. The approach offers practical advantages over RL including easier parallelization and fewer hyperparameter sensitivities. This positions ES as a viable alternative training paradigm for policy optimization tasks.

Evaluation and Benchmarking Alignment and RLHF Evolution Strategies MuJoCo Reinforcement Learning +2 more

5arXiv · cs.CL·11d ago·source ↗

BODHI: Contrastive embedding training for causal discovery in Large Behavioural Models

Researchers identify a critical failure mode in biomedical language model embeddings: off-the-shelf encoders (BioBERT, PubMedBERT, BioM-ELECTRA) assign high cosine similarity (0.76–0.92) to causally unrelated cross-domain pairs, achieving 0% accuracy on cross-domain discrimination. The paper introduces BODHI, a contrastive training approach using hard negatives mined from a biomedical knowledge graph, which improves within-vs-across-domain separation from 1.05x to 2.30x and raises discrimination gap by +0.392. The work targets Large Behavioural Models (LBMs)—foundation models that reason over personal life graphs—where false embedding proximity directly produces false causal edges. Additional contributions include an OpenVINO inference optimization achieving 133x latency reduction (1367ms to 10ms) on Intel AMX hardware, plus a counterintuitive finding that FP16 outperforms INT8 on this silicon.

Evaluation and Benchmarking Inference Economics BIOSSES BioBERT PubMedBERT +4 more

6arXiv · cs.CL·12d ago·source ↗

LLM-guided MAP-Elites evolution improves medical decision pipelines at inference time

Researchers propose using LLM-guided MAP-Elites evolutionary search as an inference-time alternative to fine-tuning for adapting LLMs to clinical workflows, formulating triage, consultation, and image classification as evolutionary searches over executable artifacts. Across three medical settings, evolved programs substantially outperform manually designed baselines: triage accuracy improves from 77.3% to 87.1% and emergency recall from 0.60 to 0.97, with gains also shown on MIMIC-ESI, iCRAFTMD, and PneumoniaMNIST. The approach works across Llama-3, Qwen-3.5, and Gemma-4 backbones and produces interpretable program-level mechanisms rather than superficial prompt changes.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemma-4 E4B-it MIMIC-ESI iCRAFTMD +6 more

5arXiv · cs.AI·10d ago·source ↗

EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents

EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.

Evaluation and Benchmarking Agent and Tool Ecosystem ACE DeepSeek V4 Qwen3-4B-Instruct +2 more

4arXiv · cs.AI·3d ago·source ↗

EvolveNav: Self-evolving memory and preflection for zero-shot object-goal navigation

EvolveNav is a new framework for Zero-Shot Object-Goal Navigation (ZS-OGN) that enables test-time improvement through a self-evolving agentic rule memory built from past trajectories. A retrieval strategy based on upper confidence bound balances semantic relevance and historical success when selecting rules, while a memory-guided preflection module forecasts action outcomes before execution to reduce inefficient exploration. The method achieves a 10.1% improvement in success rate over existing zero-shot baselines with fewer unnecessary steps.

Evaluation and Benchmarking Agent and Tool Ecosystem EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation EvolveNav

4Hugging Face Blog·1mo ago·source ↗

Generating Human-level Text with Contrastive Search in Transformers

Hugging Face introduces contrastive search, a decoding strategy for autoregressive language models that aims to produce more coherent and human-like text compared to standard methods like beam search or nucleus sampling. The technique works by balancing a model's confidence in its next-token prediction against a contrastive penalty that discourages repetitive or degenerate outputs. The blog post describes integration of contrastive search into the Hugging Face Transformers library, making it accessible to practitioners.

Frontier Model Releases Agent and Tool Ecosystem Contrastive Search Hugging Face Transformers Hugging Face