Entity · benchmark

RoboTwin

benchmarkactiverobotwin-1b2fab3b·3 events·first seen May 29, 2026

Aliases: RoboTwin

Co-occurring entities

Linear Diffusion Transformer Libero-long Cortex Observation-Guided Video-Context Routing AHA-WAM Fast-WAM Qwen-VLA DOMINO R2R LIBERO Simpler-WidowX ALOHA RxR embodiment-aware prompt conditioning Alibaba Qwen Team

More like this (12)

RoboTTT RoboTTT RoboGenesis RoboTHOR RoboWits RoboMME RoboReward LeRobot Robot Operating System nanobot Gemini Robotics ER 2 Robostral Navigate

Recent events (3)

5arXiv · cs.AI·Jul 7, 2026·source ↗

Cortex: Bidirectionally aligned VLM-VLA framework for long-horizon robot manipulation

Cortex is a new embodied agent framework that bridges the semantic gap between high-level Vision-Language Model (VLM) planning and low-level Vision-Language-Action (VLA) execution for long-horizon robotic manipulation. The system standardizes manipulation into 32 canonical skill primitives and uses an event-balanced sampling strategy to handle subtask transition ambiguity, enabling automatic annotation of over 4,000 hours of video data. Cortex outperforms monolithic baselines by 3.1% on Libero-long and 4.1% on RoboTwin benchmarks, and demonstrates zero-shot generalization to unseen real-world tasks such as multi-stage chemistry experiments.

Agent and Tool Ecosystem Multimodal Progress Libero-long RoboTwin Cortex

6arXiv · cs.AI·Jun 9, 2026·source ↗

AHA-WAM: Asynchronous world-action modeling with temporal decoupling for robot manipulation

AHA-WAM introduces a dual Diffusion Transformer architecture that decouples world prediction (low-frequency) from action execution (high-frequency) in robot manipulation policies, addressing the inefficiency of existing world-action models that force both branches to operate at the same temporal resolution. The system uses a rolling key-value memory video DiT as a long-horizon scene planner and a fast action DiT that queries layerwise latent context via joint attention, with Observation-Guided Video-Context Routing enabling asynchronous execution. On RoboTwin benchmarks, AHA-WAM achieves 92.80% average success and 78.3% on real-world tasks at 24.17 Hz, a 4.59x speedup over Fast-WAM, without robot-data pretraining.

Inference Economics RoboTwin Linear Diffusion Transformer Observation-Guided Video-Context Routing +2 more

7arXiv · cs.CL·May 29, 2026·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more