4GitHub Trending (AI/LLM filtered)·15d ago

vllm-omni: framework for efficient inference with omni-modality models

The vllm-project has published vllm-omni, a Python framework extending vLLM's inference capabilities to omni-modality models. The repository has accumulated ~4,956 GitHub stars. It represents an expansion of the vLLM ecosystem into multimodal inference serving.

Inference Economics Multimodal Progress vllm-project vllm-omni vLLM

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

3Github Trending·1mo ago·source ↗

vLLM: High-Throughput LLM Inference and Serving Engine Trending on GitHub

vLLM is an open-source Python library providing high-throughput and memory-efficient inference and serving for large language models. The project has accumulated over 80,500 GitHub stars with 98 new stars today, indicating continued strong community interest. It is a widely adopted inference backend in the AI/ML ecosystem, supporting PagedAttention and various optimization techniques for LLM deployment.

Inference Economics Agent and Tool Ecosystem vllm-project vLLM

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more

7Meta Llama·11d ago·source ↗

Meta releases Llama 3.2 90B Vision-Instruct multimodal model

Meta released Llama 3.2 90B Vision-Instruct on Hugging Face, a large multimodal model supporting image-text-to-text tasks. The model is part of the Llama 3.2 family and supports English and German. With 858 downloads and 358 likes, it represents Meta's open-weights push into vision-language capabilities at the 90B parameter scale.

Frontier Model Releases Open Weights Progress Hugging Face Meta Llama 3.2 90B Vision-Instruct +1 more

7Meta Llama·11d ago·source ↗

Meta releases Llama 3.2 90B Vision multimodal model on Hugging Face

Meta released Llama 3.2 90B Vision, a large multimodal model supporting image-text-to-text tasks, published on Hugging Face under the meta-llama organization. The model is part of the Llama 3.2 family and supports English, German, and French. This is a significant open-weights multimodal release from Meta, extending the Llama 3 series with vision capabilities at the 90B parameter scale.

Frontier Model Releases Open Weights Progress Llama 3.2 90B Vision Hugging Face Meta +1 more

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

Get your VLM running in 3 simple steps on Intel CPUs

A Hugging Face blog post describes a workflow for deploying vision-language models (VLMs) on Intel CPUs using OpenVINO, presented as a three-step process. The post targets practitioners looking to run multimodal inference on CPU hardware without requiring GPU resources. This is relevant to the inference-on-edge and CPU-based deployment pattern for multimodal models.

Inference Economics Enterprise Deployment Patterns Vision-Language Models Hugging Face Intel +2 more

7Meta Llama·11d ago·source ↗

Meta releases Llama 3.2 11B Vision Instruct multimodal model

Meta released Llama 3.2 11B Vision Instruct on Hugging Face, an open-weights multimodal model supporting image-text-to-text tasks. The model is part of the Llama 3.2 family and supports English and German. With over 157K downloads and 1,600 likes, it has seen substantial community adoption.

Open Weights Progress Multimodal Progress Hugging Face Meta Llama 3.2 90B Vision-Instruct