Entity · model

Vision-Language-Action models

modelactivevision-language-action-models-d6ef443b·5 events·first seen May 29, 2026

Aliases: Vision-Language-Action models, Vision-Language-Action

Co-occurring entities

JoyNexus NVIDIA test-time training RoboTTT RoboTTT Mistral AI SAP SAP AI Foundation Helsing RoboWits multi-agent cooperative framework UMass Embodied AGI bi-manual robotic manipulation simplex-volume minimization DynaFLIP 3D optical flow contrastive learning

More like this (12)

Vision-Language-Action model Vision-Language Models visual language model LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model Geometric Action Model Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models vision-language grounding

Recent events (5)

4arXiv · cs.AI·Jul 20, 2026·source ↗

JoyNexus: Multi-tenant post-training service for Vision-Language-Action models

JoyNexus is a proposed unified service architecture for multi-tenant supervised fine-tuning, reinforcement learning, and evaluation of Vision-Language-Action (VLA) models. The system decouples training, inference, and environment services behind APIs, with shared base models and tenant-isolated policy slots, scheduled via global training and inference queues. A group-batching technique enables a single shared backbone forward pass over heterogeneous VLA data from multiple tenants, reducing aggregate GPU time compared to isolated single-tenant execution. The work targets the inefficiency of exclusive GPU allocation for short or bursty robotics workloads.

Training Infrastructure Agent and Tool Ecosystem Vision-Language-Action models JoyNexus

7arXiv · cs.LG·Jul 17, 2026·source ↗

RoboTTT scales robot policy context to 8K timesteps via Test-Time Training, enabling one-shot imitation and long-horizon tasks

NVIDIA researchers introduce RoboTTT, a robot foundation model training recipe that extends visuomotor context to 8,000 timesteps — three orders of magnitude beyond current state-of-the-art — without increasing inference latency. The approach integrates Test-Time Training into Vision-Language-Action policies, using fast weights (parameters updated by gradient descent during inference) to compress long histories into weight space. On real-robot manipulation tasks, RoboTTT achieves 87% performance improvement over single-step baselines and is the first system to fully complete a five-minute, ten-stage assembly task. The work identifies context length as a new scaling axis for robot foundation models, with 8K-context pretraining outperforming 1K-context by 62%.

Long Context Evolution Frontier Model Releases Vision-Language-Action models NVIDIA test-time training +3 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Announces Strategic Partnerships with SAP and Helsing for German/European AI Sovereignty

Mistral AI has announced a multiyear partnership with SAP to deliver a sovereign AI stack for Germany and Europe, integrating Mistral models into SAP's AI Foundation and co-developing industry-specific solutions. Separately, Mistral is partnering with defense-AI firm Helsing to develop vision-language-action models for defense and security applications. The company is also expanding its physical presence in Germany with a new office and increased local headcount, framing these moves as part of a broader commitment to European AI autonomy.

Frontier Model Releases Enterprise Deployment Patterns Mistral AI SAP Vision-Language-Action models +3 more

5arXiv · cs.AI·May 29, 2026·source ↗

RoboWits: Benchmark for Robotic Creative Problem Solving Under Unexpected Conditions

RoboWits is a new bi-manual robotic benchmark designed to evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions in robotics. The authors introduce an automated multi-agent task generation pipeline that produces 30 seed tasks and 208 mutated tasks spanning geometry, material, and assembly-based reasoning. Benchmarking results show that pre-trained Vision-Language-Action models (VLAs) achieve limited success on seed tasks after fine-tuning but fail on mutated variants, exposing brittleness in reasoning and strategy adaptation. The benchmark highlights a significant gap between skill-level execution and genuine cognitive reasoning in current robotic systems.

Evaluation and Benchmarking Agent and Tool Ecosystem Vision-Language-Action models RoboWits multi-agent cooperative framework +3 more

6arXiv · cs.LG·May 29, 2026·source ↗

DynaFLIP: Dynamics-Aware Multimodal Pre-Training for Robot Manipulation Perception

DynaFLIP is a pre-training framework that injects motion understanding into visual encoders for robot manipulation by constructing image-language-3D flow triplets from human and robot videos. The method encourages tri-modal alignment via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. The resulting dynamics-aware visual backbone consistently outperforms baselines across diverse downstream policies including VLAs, with gains up to +22.5% in out-of-distribution scenarios. The work argues that robot generalization requires encoding how the world changes under action, not just static scene content.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action models simplex-volume minimization DynaFLIP +3 more