5arXiv cs.CL (Computation and Language)·18d ago

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

CRAM is a new method for Multimodal Continual Instruction Tuning (MCIT) that addresses the tension between catastrophic forgetting and parameter efficiency in MLLMs. It combines adaptive-rank instantiation to dynamically allocate parameters based on capability gaps, centroid-guided routing to reuse existing expert knowledge, and an orthogonality penalty to confine new updates to task-specific directions. The approach uses a Mixture-of-Experts architecture where task-specific patterns are isolated into independent modules, avoiding both the interference of shared updates and the parameter bloat of fully isolated expansion. Experiments across diverse benchmarks show consistent improvements over existing MCIT methods.

Enterprise Deployment Patterns Agent and Tool Ecosystem Multimodal Progress Multimodal Large Language Models CRAM centroid-guided routing Mixture of Experts Multimodal Continual Instruction Tuning adaptive-rank instantiation

Related guides (4)

Mixture of ExpertsConcept

Mixture of Experts: How AI Models Do More by Using Less

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Related events (8)

5arXiv · cs.LG·18d ago·source ↗

ProtoAda: Prototype-Guided Adaptive Adapter Expansion for Multimodal Continual Instruction Tuning

ProtoAda is a new framework for Multimodal Continual Instruction Tuning (MCIT) that addresses a key failure mode in sparse Mixture-of-LoRA-Experts architectures: image-text similarity routing is format-blind and incorrectly merges tasks with similar semantics but different output structures (e.g., coordinate prediction vs. VQA). The method introduces format-aware task prototypes to guide both routing and adapter expansion, then consolidates compatible updates geometrically to reuse and refine existing parameters. Experiments across multiple benchmarks show improved performance, particularly on tasks whose answer formats are vulnerable to corruption by sequential fine-tuning.

Agent and Tool Ecosystem Alignment and RLHF Multimodal Large Language Models ProtoAda LoRA +4 more

4arXiv · cs.CL·25d ago·source ↗

Prism: Plug-in Infrastructure for Multimodal Continual Instruction Tuning Research

Prism is an open-source codebase designed to address engineering bottlenecks in Multimodal Continual Instruction Tuning (MCIT) research. It introduces a plugin registration mechanism that separates algorithmic development from backbone MLLM implementation, allowing new continual learning strategies to be integrated without modifying the underlying model codebase. This design aims to eliminate structural fragmentation across method-specific implementations and enable fair, reproducible comparisons at scale.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Multimodal Continual Instruction Tuning instruction tuning +3 more

6arXiv · cs.CL·1mo ago·source ↗

AMARIS: Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS introduces a persistent evaluation memory system to improve rubric-based reward shaping in LLM fine-tuning via reinforcement learning. Unlike prior adaptive rubric methods that discard evaluation diagnostics after each step, AMARIS accumulates step-level summaries and retrieves relevant historical context via both static (recent steps) and dynamic (semantic similarity) retrieval to inform rubric updates. The system runs asynchronously alongside the RL training loop with approximately 5% time overhead. Experiments across closed and open-ended domains show consistent improvements over baselines, with ablations confirming that combining both retrieval modes yields the strongest results.

Evaluation and Benchmarking Agent and Tool Ecosystem semantic retrieval Reinforcement Learning from Human Feedback AMARIS +2 more

6arXiv · cs.CL·25d ago·source ↗

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.

Training Infrastructure Inference Economics MAGIC LLaVA-1.5-7B LLaVA-665K +5 more

6arXiv · cs.CL·4d ago·source ↗

Expert Tying reduces MoE LLM memory footprint by ~2x with minimal quality loss

Researchers introduce Expert Tying, an architectural modification for Mixture-of-Experts LLMs that shares expert parameters across consecutive transformer layers while keeping routing and attention layer-independent. Evaluated on OLMoE, Qwen3, and DeepSeek-style MoE architectures, the method achieves nearly 2x memory reduction with negligible perplexity or downstream quality degradation. The approach exploits parameter redundancy in MoE pathways to improve the compute-to-memory trade-off for training and inference.

Training Infrastructure Frontier Model Releases DeepSeek V4 Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models Expert Tying +3 more

6Qwen Research·1mo ago·source ↗

Global-batch Load Balancing for MoE LLM Training from Qwen

Qwen Research introduces a global-batch load balancing technique for Mixture-of-Experts (MoE) LLM training, claiming it is nearly a 'free lunch' improvement. The method addresses expert load imbalance across training batches, a known efficiency and quality bottleneck in MoE architectures. The approach targets the router and expert activation dynamics in transformer-based MoE layers.

Training Infrastructure Frontier Model Releases Global-batch Load Balancing Alibaba Qwen +1 more

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

4arXiv · cs.LG·12d ago·source ↗

SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs

Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.

Evaluation and Benchmarking Open Weights Progress LLaMA-7B Qwen3-4B Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning +1 more