4arXiv cs.CL (Computation and Language)·25d ago

Prism: Plug-in Infrastructure for Multimodal Continual Instruction Tuning Research

Prism is an open-source codebase designed to address engineering bottlenecks in Multimodal Continual Instruction Tuning (MCIT) research. It introduces a plugin registration mechanism that separates algorithmic development from backbone MLLM implementation, allowing new continual learning strategies to be integrated without modifying the underlying model codebase. This design aims to eliminate structural fragmentation across method-specific implementations and enable fair, reproducible comparisons at scale.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Progress Multimodal Large Language Models Multimodal Continual Instruction Tuning instruction tuning LAMDA-CL Prism

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·18d ago·source ↗

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

CRAM is a new method for Multimodal Continual Instruction Tuning (MCIT) that addresses the tension between catastrophic forgetting and parameter efficiency in MLLMs. It combines adaptive-rank instantiation to dynamically allocate parameters based on capability gaps, centroid-guided routing to reuse existing expert knowledge, and an orthogonality penalty to confine new updates to task-specific directions. The approach uses a Mixture-of-Experts architecture where task-specific patterns are isolated into independent modules, avoiding both the interference of shared updates and the parameter bloat of fully isolated expansion. Experiments across diverse benchmarks show consistent improvements over existing MCIT methods.

Enterprise Deployment Patterns Agent and Tool Ecosystem Multimodal Large Language Models CRAM centroid-guided routing +4 more

5arXiv · cs.LG·18d ago·source ↗

ProtoAda: Prototype-Guided Adaptive Adapter Expansion for Multimodal Continual Instruction Tuning

ProtoAda is a new framework for Multimodal Continual Instruction Tuning (MCIT) that addresses a key failure mode in sparse Mixture-of-LoRA-Experts architectures: image-text similarity routing is format-blind and incorrectly merges tasks with similar semantics but different output structures (e.g., coordinate prediction vs. VQA). The method introduces format-aware task prototypes to guide both routing and adapter expansion, then consolidates compatible updates geometrically to reuse and refine existing parameters. Experiments across multiple benchmarks show improved performance, particularly on tasks whose answer formats are vulnerable to corruption by sequential fine-tuning.

Agent and Tool Ecosystem Alignment and RLHF Multimodal Large Language Models ProtoAda LoRA +4 more

6arXiv · cs.AI·1mo ago·source ↗

torchtune: PyTorch Native Post-Training Library for LLMs

Meta's PyTorch team introduces torchtune, a PyTorch-native library for post-training LLMs that emphasizes modularity, hackability, and direct access to underlying PyTorch components. The library supports fine-tuning, experimentation, and deployment-oriented workflows across distributed training settings. Benchmarked against popular frameworks Axolotl and Unsloth, torchtune demonstrates competitive performance and memory efficiency while maintaining flexibility for research iteration. The paper presents design principles, model builders, training recipes, and distributed training stack details.

Training Infrastructure Open Weights Progress Unsloth Axolotl torchtune +4 more

7arXiv · cs.CL·17d ago·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO Qwen3-4B +7 more

5arXiv · cs.CL·17d ago·source ↗

Visual instruction tuning aligns modalities in intermediate LLM layers, not early ones

A new arXiv paper investigates how visual instruction tuning embeds image features into the layer-wise hierarchy of LLM backbones across diverse vision-language architectures. Using probing analyses and causal interventions, the authors find that instruction tuning routes visual features into intermediate semantic layers, bypassing early unimodal-processing layers. They further show that fine-tuning restricted to these intermediate layers alone preserves full fine-tuning performance on vision-centric benchmarks while reducing training time, suggesting multimodal integration is a localized phenomenon.

Alignment and RLHF Multimodal Progress Visual Instruction Tuning Aligns Modalities through Abstraction

6arXiv · cs.CL·19d ago·source ↗

PithTrain: A Compact and Agent-Native MoE Training System

PithTrain is a new MoE training framework designed around 'agent-native' principles, enabling AI coding agents to more efficiently understand, operate, and extend the framework. The authors introduce a new evaluation dimension called agent-task efficiency (ATE) and an accompanying benchmark ATE-Bench to measure the cost of using coding agents on training-framework tasks. PithTrain matches the throughput of production frameworks while achieving up to 62% fewer Agent Turns and 64% less Active GPU Time on ATE-Bench compared to existing systems.

Training Infrastructure Frontier Model Releases ATE-Bench Mixture of Experts PithTrain +3 more

6arXiv · cs.AI·10d ago·source ↗

Piper: Programmable distributed training system decoupling parallelism strategy from runtime

Researchers present Piper, a distributed training system that separates parallelism strategy specification from low-level runtime execution via an intermediate representation (IR) — a unified global training DAG. Users declare strategies through model annotations and scheduling directives, which Piper compiles into per-device execution plans. The system matches performance on standard strategies like ZeRO while enabling additional gains through joint compute-communication scheduling in composed strategies such as DeepSeek-V3's DualPipe.

Training Infrastructure Frontier Model Releases DeepSeek V4 Piper DualPipe +1 more

8Mistral Ai News·1mo ago·source ↗

Mistral Small 4: Unified Multimodal, Reasoning, and Coding MoE Model Released Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B-parameter Mixture-of-Experts model (6B active per token) that unifies capabilities previously split across Magistral (reasoning), Pixtral (multimodal), and Devstral (coding agents) into a single open-weights model. The model features a 256k context window, configurable reasoning effort via a `reasoning_effort` parameter, native text and image input support, and is released under Apache 2.0. Mistral claims 40% latency reduction and 3x throughput improvement over Mistral Small 3, with benchmark results showing competitive performance against GPT-OSS 120B and Qwen models while producing significantly shorter outputs. The release includes day-0 availability as an NVIDIA NIM and support across vLLM, llama.cpp, SGLang, and Transformers.

Long Context Evolution Frontier Model Releases Mistral AI Mistral Small 4 Pixtral +14 more