Ω-QVLA: Training-Free W4A4 Quantization for Full Vision-Language-Action Models Including Diffusion Action Heads
Omega-QVLA is a post-training quantization framework that compresses both the LLM backbone and the diffusion-based action head of VLA models to uniform W4A4 precision without mixed-precision schemes or fine-tuning. It combines composite SVD-Hadamard rotation for weight energy equalization with per-step DiT activation scaling to handle dynamic-range drift across denoising steps. On the LIBERO benchmark, it achieves 98.0% and 87.8% task success on Pi 0.5 and GR00T N1.5 respectively—matching or exceeding FP16 baselines—while reducing static memory footprint by 71.3%. Real-world manipulation experiments confirm the approach generalizes beyond simulation.
Related guides (3)
Related events (8)
Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments
Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.
Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA
Hugging Face published a blog post detailing the integration of 4-bit quantization via bitsandbytes into the Transformers library, enabling large language models to run on consumer-grade hardware. The post covers NF4 (NormalFloat4) data type and double quantization techniques from the QLoRA paper, which together reduce memory footprint significantly while preserving model quality. It demonstrates how users can load models like LLaMA in 4-bit precision and fine-tune them using QLoRA with minimal code changes.
Introducing AutoRound: Intel's Advanced Quantization for LLMs and VLMs
Intel has released AutoRound, an advanced quantization technique for large language models and vision-language models, announced via the Hugging Face blog. AutoRound targets efficient low-bit quantization to reduce model size and inference costs while preserving accuracy. The tool is positioned as a production-ready quantization solution integrated with the Hugging Face ecosystem.
LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics
Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.
SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data
Hugging Face introduces SmolVLA, a compact Vision-Language-Action model designed for robotics control, trained on community-contributed data from the LeRobot ecosystem. The model targets efficient deployment on resource-constrained hardware while maintaining competitive manipulation performance. This release represents a continuation of Hugging Face's strategy to democratize robotics AI through open community data pipelines.
TempoVLA: Speed-Controllable Vision-Language-Action Policy for Robot Manipulation
Researchers introduce TempoVLA, a Vision-Language-Action model that enables explicit speed control during robot manipulation by conditioning on a speed signal rather than inheriting a fixed speed from training data. The system pairs Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstrations by merging or splitting actions, with a model-side conditioning mechanism. Experiments in simulation and real-world tasks show flexible bidirectional speed control, with dynamic adaptation—accelerating in low-risk transit phases and decelerating for high-risk contact stages—achieved by coupling with a large multimodal model.
LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?
This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.
VQ-Diffusion: Vector Quantized Diffusion Models on Hugging Face
This Hugging Face blog post introduces VQ-Diffusion, a text-to-image generation approach that combines vector quantization with diffusion models. The method operates in a discrete latent space defined by a VQ-VAE codebook, applying the diffusion process to token sequences rather than continuous pixel or latent representations. The post likely covers integration into the Hugging Face diffusers ecosystem and demonstrates generation capabilities.


