5arXiv cs.LG (Machine Learning)·17h ago

BrainJanus: Unified autoregressive model for brain encoding/decoding across vision and language

BrainJanus is a unified model that integrates brain neural activity, vision, and language within a single autoregressive framework using next-token prediction. The system introduces a Unified Brain Tokenizer to quantize neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared embedding space. It supports any-to-any generation including image-to-brain, text-to-brain, brain-to-image, and brain-to-text tasks, with reported zero-shot generalization and interpretable biological topography. The work positions itself as a general-purpose brain modeling paradigm at the intersection of neuroscience and multimodal AI.

Multimodal Progress Tianjin University Haitao Wu BrainJanus

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

7Meta Ai Blog·29h ago·source ↗

Meta releases Brain2Qwerty v2: non-invasive brain-to-text decoding at 61% word accuracy

Meta AI released Brain2Qwerty v2, an end-to-end deep learning pipeline that decodes text from non-invasive magnetoencephalography (MEG) brain recordings in real time, achieving 61% word accuracy — up from 8% for prior non-invasive methods and approaching surgical-implant performance. The system was trained on ~22,000 sentences from nine participants and uses fine-tuned large language models on neural data to bridge noisy brain signals and coherent language. Meta is releasing full training code for both v1 and v2, and partner institution BCBL is releasing the v1 dataset. The work is part of Meta's broader Digital Brain Project and open neuroscience initiative.

Open Weights Progress NeuralBench Meta AI Basque Center on Cognition, Brain, and Language +4 more

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

Evaluation and Benchmarking Agent and Tool Ecosystem functional token GRPO Latent-Anchored GRPO +4 more

6Meta Ai Blog·1mo ago·source ↗

Meta Introduces TRIBE v2: Predictive Foundation Model for Human Brain Activity

Meta AI has released TRIBE v2, a foundation model that predicts high-resolution fMRI brain activity in response to visual, auditory, and language stimuli. Trained on data from over 700 healthy volunteers, it achieves a 70x resolution increase over comparable models and supports zero-shot generalization to new subjects, languages, and tasks. The release includes model weights, codebase, a research paper, and an interactive demo under a CC BY-NC license. Meta positions the work as a bridge between neuroscience and AI development, enabling hypothesis testing without requiring human subjects in every experiment.

Frontier Model Releases Multimodal Progress Algonauts 2025 Meta AI CC BY-NC +2 more

6arXiv · cs.CL·14d ago·source ↗

LOGOS: A unified autoregressive foundation model for natural science tasks across domains

Researchers introduce LOGOS (Language Of Generative Objects in Science), a generative language model that encodes heterogeneous scientific objects and spatial interactions as discrete token sequences within a single autoregressive framework, avoiding explicit coordinates or geometric neural networks. Models are trained at 1B, 3B, and 8B parameter scales and consistently match or outperform domain-specific baselines across diverse scientific tasks. The work argues that AI for Science should converge on shared architectures and training paradigms with LLMs rather than maintaining a separate technical stack. Model weights are released publicly.

Frontier Model Releases Open Weights Progress Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences LOGOS

5Openai Blog·1mo ago·source ↗

Multimodal neurons in artificial neural networks

OpenAI researchers discovered neurons in CLIP that respond to the same concept across literal, symbolic, and conceptual representations. This finding parallels multimodal neurons previously observed in biological brains and helps explain CLIP's ability to classify unusual visual renditions of concepts. The work is presented as a step toward understanding the associations and biases learned by CLIP and similar vision-language models.

AI Safety Research Multimodal Progress OpenAI multimodal neurons CLIP

6arXiv · cs.CL·1mo ago·source ↗

STORM: Internalized Spatial-Temporal Reasoning for Video-Language Models via Latent Trajectories

STORMS is a two-stage training framework that teaches large vision-language models to perform spatial-temporal video reasoning through bounded continuous latent trajectories rather than explicit textual chain-of-thought, keyframe selection, or external tool use. In Stage I, latent tokens are aligned with thought-video representations derived from generated videos; in Stage II, answer-only supervision internalizes the reasoning process. At inference time, no video regeneration or frame reinsertion is required, reducing latency and engineering complexity. Evaluations on VideoMME, MVBench, TempCompass, and MMVU show improved accuracy with substantially lower inference overhead versus tool-based pipelines.

Inference Economics Agent and Tool Ecosystem MVBench STORMS TempCompass +5 more

4Hugging Face Blog·1mo ago·source ↗

New ViT and ALIGN Models From Kakao Brain

Kakao Brain released new Vision Transformer (ViT) and ALIGN models, announced via the Hugging Face blog. The post covers multimodal vision-language models contributed to the open ecosystem. These models expand the available open-weights options for image-text tasks.

Open Weights Progress Multimodal Progress ViT (Vision Transformer)Hugging Face ALIGN +1 more

5arXiv · cs.AI·1mo ago·source ↗

Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

This paper introduces a framework for evaluating alignment between artificial vision models and the human visual cortex that goes beyond scalar prediction accuracy. Using repeated fMRI data from the Natural Scenes Dataset, the authors decompose brain response spaces into reproducible dimensions and measure which of these dimensions are recovered by model predictions. A key finding is that pretrained and randomly initialized models can achieve similar prediction accuracy while showing distinct recovery profiles, revealing that accuracy alone can mask fundamental model-brain mismatches. The framework also enables brain-to-brain comparisons as a diagnostic human reference baseline.

Evaluation and Benchmarking Multimodal Progress Natural Scenes Dataset human visual cortex target-space recovery profiles +1 more