Entity · benchmark

MS COCO

benchmarkactivems-coco-cd79b25f·3 events·first seen Jun 3, 2026

Aliases: MS COCO, MS-COCO

Co-occurring entities

Complex Social Behavior dataset Evolution of Accuracy and Visual-Cognitive Errors in a Decade of Vision-Language AI Models DOCCI IIW TEVI Flickr30k RoCOCO CLIP DINOv2 Yuan Gao SiT SigLIP 2 Jiatao Gu CC12M Apple ImageNet Feature Auto-Encoder

More like this (12)

COCO COCOAI RoCOCO LoCoMo nanocoai YOCO CoInCo CO-LMLM MoCA SA-Co OpenCoF USACO

Recent events (3)

5arXiv · cs.AI·Jul 13, 2026·source ↗

CSB dataset benchmarks a decade of VLM progress on complex social scene understanding

Researchers introduce the Complex Social Behavior (CSB) dataset of 100 images depicting complex social interactions, used to evaluate nine vision-language models spanning 2017–2025 against human descriptions and a gold standard. MLLMs have largely closed the accuracy gap with top-ranked human descriptions and nearly eliminated most error types (object detection, recognition, hallucination, scene understanding), with spatial dependence errors being the notable remaining failure mode. The study also finds that MLLMs have eliminated the accuracy gap between simple MS-COCO scenes and complex social scenes, a gap that pre-MLLM models struggled with significantly.

Evaluation and Benchmarking Multimodal Progress Complex Social Behavior dataset MS COCO Evolution of Accuracy and Visual-Cognitive Errors in a Decade of Vision-Language AI Models

5arXiv · cs.CL·Jun 8, 2026·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

Evaluation and Benchmarking Multimodal Progress DOCCI MS COCO IIW +4 more

5The Batch·Jun 3, 2026·source ↗

Apple researchers propose Feature Auto-Encoder to speed diffusion training via compressed DINOv2 embeddings

Researchers at Apple introduced Feature Auto-Encoder (FAE), a latent diffusion image generator that compresses DINOv2 vision encoder embeddings before learning to denoise them, then expands them back for decoding. The approach achieves comparable image quality to state-of-the-art diffusion models while training roughly 7x faster on ImageNet class-conditional generation. The key insight is that shrinking semantically rich vision embeddings reduces compute during diffusion training without sacrificing the representational benefits of large pretrained encoders.

Training Infrastructure Multimodal Progress DINOv2 Yuan Gao MS COCO +7 more