MS COCO
ms-coco-cd79b25f·2 events·first seen 14d agoAliases: MS COCO
Co-occurring entities
More like this (12)
Recent events (2)
Apple researchers propose Feature Auto-Encoder to speed diffusion training via compressed DINOv2 embeddings
Researchers at Apple introduced Feature Auto-Encoder (FAE), a latent diffusion image generator that compresses DINOv2 vision encoder embeddings before learning to denoise them, then expands them back for decoding. The approach achieves comparable image quality to state-of-the-art diffusion models while training roughly 7x faster on ImageNet class-conditional generation. The key insight is that shrinking semantically rich vision embeddings reduces compute during diffusion training without sacrificing the representational benefits of large pretrained encoders.
TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment
TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.