GRASP: Gradient-based Planning for World Models at Longer Horizons
Researchers from Berkeley, Meta, and collaborators introduce GRASP, a gradient-based planner designed to make long-horizon planning with learned world models more robust. The method addresses three core failure modes: ill-conditioned computation graphs from backpropagation through time, non-greedy loss landscapes with many local minima, and brittle gradients through high-dimensional vision models. GRASP lifts trajectory optimization into virtual states for parallel optimization across time, injects stochasticity into state iterates for exploration, and reshapes gradients to avoid problematic state-input gradient paths. The work is positioned in the context of scaling world models toward general-purpose simulators usable for control and planning.
Related guides (3)
Related events (8)
GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases
GRASP is a three-stage retrieval framework for semi-structured knowledge bases (SKBs) that combines plan-based graph retrieval, plan-conditioned dense retrieval fusion, and a fine-tuned reranker. It targets applications like product search, academic search, and precision medicine over typed entity-relation graphs. Evaluated on the STaRK benchmarks, GRASP advances average Hit@1 from 62.0 to 73.9, representing a substantial improvement over prior hybrid retrieval systems. Ablation studies confirm the contribution of each component.
GSPO: Group Sequence Policy Optimization for Scalable RL Training of Language Models
Qwen researchers introduce Group Sequence Policy Optimization (GSPO), a new RL algorithm designed to address severe training instability and model collapse observed in existing methods like GRPO during extended training runs. The core motivation is enabling stable RL scaling for language models to improve reasoning and problem-solving capabilities with increased compute. The paper targets a known bottleneck in post-training pipelines where instability prevents further performance gains.
N-GRPO: Semantic Neighbor Mixing for Improved Policy Optimization in LLM Reasoning
A new arXiv preprint introduces N-GRPO, an exploration strategy for the GRPO reinforcement learning framework that improves solution diversity during rollout by mixing embeddings of anchor tokens with their nearest semantic neighbors rather than using token-level sampling or random noise. The method is evaluated on DeepSeek-R1-Distill-Qwen models of various sizes and shows consistent improvements on math reasoning benchmarks plus out-of-distribution generalization. The work targets a known limitation in RLHF-style training: redundant rollout trajectories that reduce effective learning signal.
General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks
GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).
Episodic Context and Persistent 3D World Models Enable Curiosity-Driven Exploration in Photorealistic Environments
This paper addresses the failure modes of curiosity-driven RL in complex 3D environments, where agents revisit forgotten states and get trapped in local loops due to lacking spatial persistence and episodic memory. The authors combine an online 3D reconstruction as a persistent world model with a sequence-model policy over RGB observations to maintain episodic trajectory context. Trained purely via intrinsic curiosity on HM3D, the agent outperforms RL-based active mapping baselines and zero-shot generalizes to Gibson and AI-generated environments. The approach also enables efficient downstream task adaptation for apple picking and image-goal navigation.
GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment
Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.
Geometric Action Model (GAM) repurposes geometric foundation models for 3D-aware robot manipulation
Researchers propose the Geometric Action Model (GAM), a language-conditioned robot manipulation policy that splits a pretrained geometric foundation model (GFM) to serve simultaneously as an observation encoder, causal future predictor, and action decoder. Unlike existing vision-language-action models that operate on 2D image frames, GAM explicitly incorporates 3D geometric priors for contact-rich manipulation. The approach claims improvements in accuracy, robustness, speed, and model size over foundation-model-scale baselines across simulation and real-robot benchmarks.
WorldString: Actionable World Representation via Neural Architecture for Object State Modeling
This paper proposes WorldString, a neural architecture designed to model the state manifold of real-world objects by learning from point clouds or RGB-D video streams. Unlike prior approaches that rely on video generation or dynamic scene reconstruction, WorldString explicitly models object action states in a unified, principled framework. It is positioned as a foundational building block for physical world models, functioning as a versatile digital twin. Its fully differentiable structure is intended to enable integration with policy learning and neural dynamics.


