Almanac
← Events
5arXiv cs.AI (Artificial Intelligence)·4d ago

ActiveSAM: Training-free open-vocabulary segmentation via image-conditional class pruning on SAM 3

ActiveSAM is a training-free, zero-shot inference framework that wraps Segment Anything Model 3 (SAM 3) to perform open-vocabulary semantic segmentation more efficiently. It estimates an image-conditioned active class subset at low resolution before running full-resolution decoding only on retained classes, using bucketed prompt multiplexing and margin-aware background calibration. Across eight benchmarks, it outperforms the prior state-of-the-art SegEarth-OV3 by ~1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets, with strong robustness to image corruption relevant to autonomous driving and embodied AI.

Related guides (2)

Related events (8)

7Meta Ai Blog·1mo ago·source ↗

SAM 3.1: Meta Releases Faster Real-Time Video Segmentation Model with Object Multiplexing

Meta has released SAM 3.1, an incremental update to Segment Anything Model 3, introducing object multiplexing that allows tracking up to 16 objects in a single forward pass. This doubles video processing throughput from 16 to 32 FPS on a single H100 GPU, reducing GPU resource requirements and enabling real-time tracking on smaller hardware. SAM 3.1 is a drop-in replacement for SAM 3 and is available via updated model checkpoints and codebase. The broader SAM 3 release also includes text and exemplar prompting, a new Segment Anything Playground, the SA-Co evaluation dataset, and SAM 3D for 3D reconstruction.

6Github Trending·29d ago·source ↗

Meta SAM 3 (Segment Anything Model 3) Released on GitHub

Meta / Facebook Research has released SAM 3, the third generation of their Segment Anything Model, with code for inference and finetuning, pretrained model checkpoints, and example notebooks. The repository has accumulated over 10,000 stars with strong daily momentum (+93). SAM 3 continues Meta's open-weights tradition in computer vision foundation models. No accompanying paper or technical blog post is referenced in this item.

7Meta Ai Blog·1mo ago·source ↗

Meta Introduces SAM Audio: Unified Multimodal Model for Audio Separation with PE-AV, Benchmark, and Judge Model

Meta has released SAM Audio, a unified multimodal audio separation model that accepts text, visual, and temporal span prompts to isolate sounds from complex audio mixtures. The system is powered by Perception Encoder Audiovisual (PE-AV), an extension of Meta's open-source Perception Encoder released earlier in 2025, and uses a flow-matching diffusion transformer architecture. Alongside the model, Meta is releasing SAM Audio-Bench (the first in-the-wild audio separation benchmark) and SAM Audio Judge (an automatic evaluation model for audio separation). All components are available today via the Segment Anything Playground.

4Hugging Face Blog·1mo ago·source ↗

Zero-shot image segmentation with CLIPSeg

This Hugging Face blog post introduces CLIPSeg, a model that performs zero-shot image segmentation by leveraging CLIP-based text and image prompts. The approach allows segmentation of arbitrary image regions without task-specific training, using natural language or reference images as queries. The post likely covers integration into the Hugging Face ecosystem and practical usage examples.

4arXiv · cs.AI·16d ago·source ↗

BabyCL: Continual multimodal learning from egocentric child video in a single chronological pass

Researchers introduce BabyCL, a continual learning framework that processes the SAYCam egocentric child video dataset in a single chronological pass rather than shuffled multi-epoch training, more closely mimicking how children actually encounter their environment. The system combines streaming visual representation learning with image-text contrastive objectives, a multi-stage temporal segmentation, and a dual replay buffer managing visual and multimodal histories. BabyCL outperforms streaming baselines on the SAYCam Labeled-S 4AFC benchmark under matched compute budgets, substantially closing the gap to offline training upper bounds. The work advances understanding of whether neural networks can acquire word-referent mappings under biologically plausible training conditions.

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

4Meta Ai Blog·1mo ago·source ↗

USRA Applies SAM 2 Fine-Tuning for Real-Time Flood and River Monitoring

The Universities Space Research Association (USRA) and Meta are collaborating with the U.S. Geological Survey (USGS) to apply a fine-tuned version of SAM 2 for automated water segmentation in drone and satellite imagery, targeting real-time flood detection and river extent mapping. The fine-tuned model replaces a labor-intensive manual digitization workflow that was a key bottleneck in rapid-response image analysis. The system integrates with PlanetScope satellite imagery and USGS 3D Hydrography data, with case studies in the Chesapeake Bay area showing promise for nationwide deployment. The collaboration also anticipates leveraging the recently released SAM 3 for unified detection, segmentation, and tracking.

5arXiv · cs.CL·12d ago·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.