Almanac
model

ViT (Vision Transformer)

modelactivevit-vision-transformer--f25a37b0·3 events·first seen 28d ago

Aliases: ViT (Vision Transformer), Vision Transformers (ViTs), Vision Transformer (ViT), Vision Transformer

Co-occurring entities

More like this (12)

Recent events (3)

5arXiv · cs.LG·14d ago·source ↗

Information-theoretic formalization of the binding problem in Vision Transformers

Researchers introduce a formal information-theoretic framework for the binding problem — the challenge of associating features (color, shape) with the correct objects in multi-object scenes. They develop a probing method to measure binding information in model representations and apply it to several pre-trained Vision Transformers, examining components like the [CLS] token and spatial tokens across datasets with feature sharing, occlusion, and natural features. Results position binding information as a key factor in visual recognition and reasoning quality, and suggest current ViT architectures have limited binding capability, consistent with known failure modes.

4Hugging Face Blog·28d ago·source ↗

New ViT and ALIGN Models From Kakao Brain

Kakao Brain released new Vision Transformer (ViT) and ALIGN models, announced via the Hugging Face blog. The post covers multimodal vision-language models contributed to the open ecosystem. These models expand the available open-weights options for image-text tasks.

6arXiv · cs.AI·22d ago·source ↗

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

This paper introduces Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework for ultra-low-bit quantization of LLMs and Vision Transformers targeting edge deployment. ORP addresses the structural limitations of Power-of-Two (PoT) quantization by formulating quantization as a dual-basis geometric projection that synthesizes higher-resolution residual lattices using only shift-and-add operations, eliminating multipliers. At 3-bit (W3/A16), ORP achieves 6.10 perplexity on LLaMA-2-7B, competitive with MAC-intensive baselines like AWQ, while reducing full-model calibration time to ~15 minutes. RTL synthesis at 28nm confirms hardware efficiency by mitigating timing bottlenecks from dense multiplier trees.