Entity · model

ViT (Vision Transformer)

modelactivevit-vision-transformer--f25a37b0·6 events·first seen May 19, 2026

Aliases: ViT (Vision Transformer), Vision Transformers (ViTs), Vision Transformer (ViT), Vision Transformer

Co-occurring entities

Patch Policy OpenVLA-OFT Towards Robustness against Typographic Attack with Training-free Concept Localization RIO-Bench CLIP HiReLC Formalizing the Binding Problem Orthogonal Residual Projection AWQ LLaMA-7B Power-of-Two (PoT) Quantization OrpQuant / ORP Hugging Face ALIGN Kakao Brain

More like this (12)

VisualMem visual geometry transformer τ-Voice ViT-Base ActiveVision GPT-4 Turbo with Vision IVGT Vision-OPD VPT Model MobileViT-S Veo MoonViT

Recent events (6)

6arXiv · cs.LG·Jul 21, 2026·source ↗

Patch Policy: Lightweight transformer policy using dense ViT patch tokens for robot control

Patch Policy is a new robot learning architecture that enables transformer-based policies to consume dense pre-trained Vision Transformer patch tokens directly, without the overhead of a full vision-language model backbone. A block-causal attention mask preserves temporal causality while allowing the model to attend over many patch tokens per observation. Across four simulated and three real-world environment suites, the method achieves a 40% relative improvement over global-pooled representation baselines and outperforms fine-tuned OpenVLA-OFT by 18% while using roughly 0.7% of its parameters. The work addresses a practical gap between lightweight robot policies and expensive VLA models.

Inference Economics Multimodal Progress ViT (Vision Transformer)Patch Policy OpenVLA-OFT

5arXiv · cs.CL·Jul 3, 2026·source ↗

Training-free mechanistic defense against typographic attacks on CLIP-based vision encoders

Researchers propose a training-free method to defend CLIP-based vision encoders against typographic attacks, where irrelevant text embedded in images biases visual representations toward lexical rather than semantic meaning. The approach uses sampling-based mechanistic interpretability to identify specific Vision Transformer attention heads responsible for encoding lexical information, then applies targeted circuit-level interventions to suppress this behavior. Without any retraining, the method outperforms both supervised and training-free baselines on object classification and improves Visual Question Answering accuracy under typographic attack conditions on RIO-Bench across several state-of-the-art LVLMs.

Evaluation and Benchmarking AI Safety Research ViT (Vision Transformer)Towards Robustness against Typographic Attack with Training-free Concept Localization RIO-Bench +2 more

4arXiv · cs.AI·Jun 25, 2026·source ↗

HiReLC: Hierarchical Reinforcement Learning Framework for Joint Neural Network Pruning and Quantization

Researchers introduce HiReLC, a hierarchical ensemble-RL framework that automates joint quantization and structured pruning of deep neural networks. The system uses two-level agents — low-level agents selecting per-kernel compression configurations and high-level agents coordinating global budget allocation via Fisher Information-based sensitivity estimates. Experiments on Vision Transformers and CNNs achieve 5.99–6.72× parameter-storage compression with accuracy drops of 0.55–5.62% in most settings. The controller is architecture-agnostic, using a surrogate MLP and active learning loop to reduce policy evaluation cost.

Training Infrastructure Inference Economics HiReLC ViT (Vision Transformer)

5arXiv · cs.LG·Jun 3, 2026·source ↗

Information-theoretic formalization of the binding problem in Vision Transformers

Researchers introduce a formal information-theoretic framework for the binding problem — the challenge of associating features (color, shape) with the correct objects in multi-object scenes. They develop a probing method to measure binding information in model representations and apply it to several pre-trained Vision Transformers, examining components like the [CLS] token and spatial tokens across datasets with feature sharing, occlusion, and natural features. Results position binding information as a key factor in visual recognition and reasoning quality, and suggest current ViT architectures have limited binding capability, consistent with known failure modes.

Evaluation and Benchmarking Multimodal Progress ViT (Vision Transformer)Formalizing the Binding Problem

6arXiv · cs.AI·May 26, 2026·source ↗

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

This paper introduces Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework for ultra-low-bit quantization of LLMs and Vision Transformers targeting edge deployment. ORP addresses the structural limitations of Power-of-Two (PoT) quantization by formulating quantization as a dual-basis geometric projection that synthesizes higher-resolution residual lattices using only shift-and-add operations, eliminating multipliers. At 3-bit (W3/A16), ORP achieves 6.10 perplexity on LLaMA-2-7B, competitive with MAC-intensive baselines like AWQ, while reducing full-model calibration time to ~15 minutes. RTL synthesis at 28nm confirms hardware efficiency by mitigating timing bottlenecks from dense multiplier trees.

Training Infrastructure Evaluation and Benchmarking ViT (Vision Transformer)Orthogonal Residual Projection AWQ +5 more

4Hugging Face Blog·May 19, 2026·source ↗

New ViT and ALIGN Models From Kakao Brain

Kakao Brain released new Vision Transformer (ViT) and ALIGN models, announced via the Hugging Face blog. The post covers multimodal vision-language models contributed to the open ecosystem. These models expand the available open-weights options for image-text tasks.

Open Weights Progress Multimodal Progress ViT (Vision Transformer)Hugging Face ALIGN +1 more