5arXiv cs.LG (Machine Learning)·11d ago

Topo-Omni: Topographic multimodal model discovers functionally selective brain regions consistent with human neuroimaging

Researchers introduce Topo-Omni, a topographic multimodal model that jointly represents visual, auditory, and language/cognitive processing on a single contiguous in-silico cortical sheet, built by fine-tuning a pretrained foundation model with a spatial smoothness objective. The model develops clusters consistent with human neuroimaging data, and driving or suppressing clusters selectively biases or impairs perception in ways that parallel human intervention studies. The authors use the model to screen for novel cortical networks in-silico and validate discoveries — including natural landscape and animal networks — in human neuroimaging data. The work bridges deep learning architectures and computational neuroscience, offering testable hypotheses about cortical organization.

Multimodal Progress Topo-Omni

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

This paper introduces a framework for evaluating alignment between artificial vision models and the human visual cortex that goes beyond scalar prediction accuracy. Using repeated fMRI data from the Natural Scenes Dataset, the authors decompose brain response spaces into reproducible dimensions and measure which of these dimensions are recovered by model predictions. A key finding is that pretrained and randomly initialized models can achieve similar prediction accuracy while showing distinct recovery profiles, revealing that accuracy alone can mask fundamental model-brain mismatches. The framework also enables brain-to-brain comparisons as a diagnostic human reference baseline.

Evaluation and Benchmarking Multimodal Progress Natural Scenes Dataset human visual cortex target-space recovery profiles +1 more

4arXiv · cs.LG·11d ago·source ↗

Topological Neural Operators: operator learning on cell complexes via Discrete Exterior Calculus

Researchers introduce Topological Neural Operators (TNOs), a framework that extends neural operators from point/edge functions to general topological domains (cell complexes) using Discrete Exterior Calculus. The design decouples fixed topological information flow from learned transformations, enabling models that respect geometric structure and conservation laws. A hierarchical variant (HTNOs) adds learned coarse complexes for long-range propagation. TNOs subsume existing neural operators as a special case and show accuracy improvements on PDE benchmarks including irregular-geometry flow problems.

Evaluation and Benchmarking Discrete Exterior Calculus Topological Neural Operators

5Openai Blog·1mo ago·source ↗

Multimodal neurons in artificial neural networks

OpenAI researchers discovered neurons in CLIP that respond to the same concept across literal, symbolic, and conceptual representations. This finding parallels multimodal neurons previously observed in biological brains and helps explain CLIP's ability to classify unusual visual renditions of concepts. The work is presented as a step toward understanding the associations and biases learned by CLIP and similar vision-language models.

AI Safety Research Multimodal Progress OpenAI multimodal neurons CLIP

9Openai Blog·1mo ago·source ↗

Hello GPT-4o

OpenAI announces GPT-4o (Omni), a new flagship multimodal model capable of reasoning across audio, vision, and text in real time. The model represents a significant step toward natively multimodal AI, processing and generating across modalities without separate pipeline stages. It is positioned as OpenAI's primary production model going forward.

Frontier Model Releases Inference Economics GPT-4o OpenAI GPT-4 +1 more

6arXiv · cs.CL·2d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

Inference Economics Agent and Tool Ecosystem OmniAgent Qwen2.5-VL-72B LVBench +4 more

6Meta Ai Blog·1mo ago·source ↗

Meta Introduces TRIBE v2: Predictive Foundation Model for Human Brain Activity

Meta AI has released TRIBE v2, a foundation model that predicts high-resolution fMRI brain activity in response to visual, auditory, and language stimuli. Trained on data from over 700 healthy volunteers, it achieves a 70x resolution increase over comparable models and supports zero-shot generalization to new subjects, languages, and tasks. The release includes model weights, codebase, a research paper, and an interactive demo under a CC BY-NC license. Meta positions the work as a bridge between neuroscience and AI development, enabling hypothesis testing without requiring human subjects in every experiment.

Frontier Model Releases Multimodal Progress Algonauts 2025 Meta AI CC BY-NC +2 more

4Qwen Research·1mo ago·source ↗

OFA: Towards Building a One-For-All Unified Multimodal Pretrained Model

Alibaba's Qwen team introduces OFA (One-For-All), a unified multimodal pretrained model designed to handle both understanding and generation tasks across multiple modalities within a single framework. The model is pretrained using instruction-based multitask pretraining to endow it with diverse capabilities. This work was published in late 2022 as part of the broader wave of generalist multimodal models. It represents an early effort toward a single model architecture capable of spanning vision, language, and cross-modal tasks.

Frontier Model Releases Multimodal Progress Alibaba DAMO Academy Qwen OFA (One-For-All)+1 more

7Qwen Research·1mo ago·source ↗

Qwen2.5-Omni: Alibaba Releases End-to-End Multimodal Model with Real-Time Streaming

Alibaba's Qwen team releases Qwen2.5-Omni, a 7B-parameter end-to-end multimodal model capable of processing text, images, audio, and video simultaneously. The model delivers real-time streaming responses in both text and natural speech synthesis. It is openly available on Hugging Face, ModelScope, DashScope, and GitHub, accompanied by a technical paper.

Frontier Model Releases Open Weights Progress Alibaba Qwen2.5-Omni Qwen +5 more