5arXiv cs.AI (Artificial Intelligence)·41h ago

AIR: Adaptive Interleaved Reasoning with Code in Multimodal LLMs via Reinforcement Learning

Researchers propose AIR, a system that trains multimodal large language models to adaptively interleave reasoning with code execution for numerical computation tasks, going beyond prior work that focused only on visual operations. The approach combines a two-stage cold-start data pipeline, RL dataset filtering, and a group-constrained reward function for tool-invocation decisions. Experiments show a 6.1 percentage point average improvement on evaluation benchmarks, with interleaved reasoning samples gaining 9.9 pp and tool-use success exceeding 95%.

Agent and Tool Ecosystem Alignment and RLHF Multimodal Progress AIR: Adaptive Interleaved Reasoning with Code in MLLMs OpenAI

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

LANG: Reinforcement Learning Framework for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG is a new RL-based framework for improving multilingual reasoning in LLMs that addresses the trade-off between input-language consistency and reasoning quality. It uses language-conditioned hints with a progressive decay schedule and a language-adaptive switch to tailor learning to per-language difficulty. Empirical results on multilingual mathematical benchmarks show improved reasoning without language drift toward English, and the approach generalizes beyond mathematics.

Evaluation and Benchmarking Alignment and RLHF large language models LANG multilingual mathematical benchmarks +3 more

9Openai Blog·1mo ago·source ↗

Learning to Reason with LLMs

OpenAI announced a new model or capability focused on reasoning in large language models, published on September 12, 2024. The post, hosted on the OpenAI blog, describes advances in training LLMs to perform complex multi-step reasoning. This likely corresponds to the release of the o1 (formerly 'Strawberry') model series, which uses chain-of-thought reasoning trained via reinforcement learning to achieve significantly improved performance on math, science, and coding benchmarks.

Frontier Model Releases Evaluation and Benchmarking Chain-of-Thought Reasoning Reinforcement Learning OpenAI +3 more

6arXiv · cs.LG·8d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL

6arXiv · cs.AI·29d ago·source ↗

ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs

ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.

Agent and Tool Ecosystem Alignment and RLHF Reasoning Enhancement Qwen3-4B ETCHR +5 more

6arXiv · cs.AI·26d ago·source ↗

Reasoning in Memory (RiM): Latent Reasoning via Working Memory Blocks in LLMs

RiM introduces a latent reasoning method that replaces autoregressive chain-of-thought token generation with fixed sequences of special 'memory block' tokens, allowing LLMs to perform internal computation without externalizing intermediate steps. These memory blocks are processed in a single forward pass rather than generated autoregressively, improving compute efficiency at test time. Training uses a two-stage curriculum: first grounding memory blocks by predicting explicit reasoning steps, then discarding step-level supervision and refining answers iteratively. Experiments across multiple model families and sizes show RiM matches or exceeds existing latent reasoning methods.

Evaluation and Benchmarking Inference Economics latent reasoning Chain-of-Thought Reasoning Reasoning in Memory (RiM)+3 more

5arXiv · cs.CL·8d ago·source ↗

ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs

Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.

Agent and Tool Ecosystem Alignment and RLHF GRPO ContextRL +1 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

A BAIR blog post surveys recent progress in parallel reasoning for LLMs, covering methods from simple self-consistency and Best-of-N sampling through structured search (Tree of Thoughts, MCTS) to newer adaptive approaches including ParaThinker, GroupThink, and Hogwild! Inference. The core motivation is that sequential reasoning scales linearly with exploration depth, causing latency, context-rot, and compute inefficiency. Adaptive parallel reasoning aims to let models themselves decide when and how to decompose tasks into concurrent threads, rather than imposing fixed parallel structure externally. The post frames this as an emerging inference-time scaling paradigm with implications for agentic and complex reasoning workloads.

Long Context Evolution Frontier Model Releases ParaThinker Berkeley AI Research (BAIR)DeepSeek V4 +11 more

4arXiv · cs.CL·14d ago·source ↗

GenAIR: LLM-grounded archetype representations improve sequential recommendation

GenAIR is a framework that uses LLMs to infer 'archetype' profiles of items' ideal target audiences, generating richer item embeddings for sequential recommendation systems. A behavioral calibration objective aligns these semantic embeddings with actual user interaction patterns, closing the gap between language-space representations and real-world behavior. Experiments on three datasets show consistent improvements over state-of-the-art baselines across multiple sequential recommendation models.

Enterprise Deployment Patterns GenAIR