Latent World Recovery: multimodal learning framework for missing modalities in bioscience
A new arXiv preprint introduces Latent World Recovery (LWR), a framework for multimodal learning when some modalities are unavailable at training or inference time. LWR aligns modality-specific embeddings in a shared latent space and fuses only available modalities, avoiding explicit reconstruction of missing ones. The approach is evaluated on incomplete multi-omics benchmarks for cancer phenotype classification and survival prediction, demonstrating robustness under partial observation.
Related guides (1)
Related events (8)
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.
Looped World Models introduce iterative latent depth as a new scaling axis for world simulation
A new arXiv preprint introduces Looped World Models (LoopWM), a parameter-shared transformer architecture that iteratively refines latent environment states to achieve up to 100x parameter efficiency over conventional world models. The approach uses adaptive computation to scale depth dynamically per prediction step, addressing the tension between long-horizon simulation fidelity and deployment cost. The authors position iterative latent depth as a new scaling axis orthogonal to model size and training data.
Adversarial Subspace Alignment for Robust Multimodal Knowledge Editing in MLLMs
This paper addresses the generalization gap in multimodal large language model (MLLM) knowledge editing, where edits fail to propagate across semantically equivalent visual and linguistic variations. The authors introduce Latent Adversarial Robustification (LAR), which generates adversarial but semantically coherent variants in joint latent space, and Rank-Constrained Subspace Learning (RCSL), which enforces low-rank alignment of adversarial representations at the edit layer. Together these form the ASAM framework, which formalizes robustness via knowledge units grouping semantically equivalent multimodal inputs. Empirical analysis demonstrates improved generality without sacrificing reliability or locality.
MedRLM: Recursive multimodal agent framework for long-context clinical decision support
MedRLM is a proposed framework for clinical decision support that uses recursive multi-agent reasoning over heterogeneous patient data including EHRs, medical images, physiological sensor streams, and clinical guidelines. Rather than single-step prompting, it decomposes patient cases into an inspectable external environment coordinated by specialized agents, with a Clinical Evidence Graph Memory and sensor-triggered deeper reasoning. The paper outlines an evaluation design using public and credentialed clinical datasets spanning radiology, ECG, ICU time series, and referral outcomes. The work targets a gap between static medical QA benchmarks and real-world longitudinal clinical workflows.
Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading
Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.
ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs
Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.
Sleep paradigm for LLMs enables continual learning and memory consolidation via distillation and RL
A new arXiv preprint proposes a 'Sleep' paradigm for language models that enables continual learning by consolidating short-term in-context memories into long-term parameters. The framework has two stages: Knowledge Seeding (distilling a smaller model's memories into a larger network via on-policy distillation combined with RL-based imitation learning) and Dreaming (self-improvement via RL-generated synthetic curricula without human supervision). Experiments cover long-horizon tasks, continual learning, knowledge incorporation, and few-shot generalization, addressing a known weakness of current LLMs in retaining temporal knowledge across contexts.
VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading
This paper compares matched LLM and VLM pairs in a text-only setting to isolate the effect of multimodal training history on human-like language processing. Using whole-cortex fMRI and eye-tracking data from natural reading, the authors find that multimodal pretraining does not confer a uniform global advantage in human alignment. However, VLMs show selective advantages when sentences contain stronger visual semantic content, with converging evidence from both neural and behavioral measures. The findings suggest language-internal representations remain the primary driver of human text processing alignment.
