5arXiv cs.AI (Artificial Intelligence)·26d ago

Adversarial Subspace Alignment for Robust Multimodal Knowledge Editing in MLLMs

This paper addresses the generalization gap in multimodal large language model (MLLM) knowledge editing, where edits fail to propagate across semantically equivalent visual and linguistic variations. The authors introduce Latent Adversarial Robustification (LAR), which generates adversarial but semantically coherent variants in joint latent space, and Rank-Constrained Subspace Learning (RCSL), which enforces low-rank alignment of adversarial representations at the edit layer. Together these form the ASAM framework, which formalizes robustness via knowledge units grouping semantically equivalent multimodal inputs. Empirical analysis demonstrates improved generality without sacrificing reliability or locality.

Alignment and RLHF Multimodal Progress Multimodal Large Language Models Latent Adversarial Robustification (LAR)knowledge editing ASAM Rank-Constrained Subspace Learning (RCSL)

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·17d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

5arXiv · cs.CL·17d ago·source ↗

Knowledge editing via locate-then-edit transferred to masked diffusion language models, revealing multi-token failure mode

A new arXiv paper investigates whether locate-then-edit knowledge editing methods, developed for autoregressive models, transfer to masked diffusion language models (MDMs) such as LLaDA and Dream. The authors find that causal tracing identifies the same early-to-mid-layer MLP location in both paradigms, but MDMs degrade systematically on multi-token edits due to partially unmasked intermediate states that the edit was never optimized for. A correction targeting these intermediate states substantially restores multi-token editing performance. The work is the first systematic comparison of knowledge editing across autoregressive and diffusion-based language model paradigms.

Evaluation and Benchmarking Open Weights Progress Knowledge Editing in Masked Diffusion Language Models Qwen Llama +2 more

4arXiv · cs.LG·12d ago·source ↗

SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs

Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.

Evaluation and Benchmarking Open Weights Progress LLaMA-7B Qwen3-4B Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning +1 more

4arXiv · cs.LG·5d ago·source ↗

Dual-adapter routing system improves knowledge editing precision in LLMs

A new arXiv paper introduces a route-specialized dual-adapter architecture for knowledge editing in LLMs, separating the concerns of writing edits (edit adapter) and suppressing them when irrelevant (locality adapter). A relevance router gates which adapter is applied, addressing the locality problem in memory-assisted editing. Evaluated on CounterFact, zsRE, and MQuAKE benchmarks using Llama-3.1-8B-Instruct and Qwen3-8B, the method achieves best-in-class probability-preference accuracy across all three datasets. Ablations show the gain comes from the architectural separation rather than increased parameter capacity.

Evaluation and Benchmarking Alignment and RLHF BGE Llama3-8B-Instruct Qwen3-4B +4 more

5arXiv · cs.CL·4d ago·source ↗

ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs

Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.

Agent and Tool Ecosystem Alignment and RLHF GRPO ContextRL +1 more

6arXiv · cs.AI·26d ago·source ↗

ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs

ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.

Agent and Tool Ecosystem Alignment and RLHF Reasoning Enhancement Qwen3-4B ETCHR +5 more

4arXiv · cs.CL·25d ago·source ↗

WhoSaidIt: Human-LLM Collaborative Annotation for Multilingual Speaker-Attribute Classification

This paper proposes a human-LLM collaborative re-annotation framework for stabilizing noisy multilingual speaker-attribute labels under resource constraints. LLMs surface recurring annotation rationales through iterative expert interaction, combined with disagreement-focused sampling for targeted re-annotation. The resulting WhoSaidIt dataset covers nine speaker-attribute labels across multiple languages. Benchmarking of recent LLMs reveals substantial cross-lingual annotation divergence and highlights both capabilities and limitations of LLMs in this classification task.

Evaluation and Benchmarking Agent and Tool Ecosystem human-LLM collaborative annotation speaker-attribute classification WhoSaidIt +1 more

6arXiv · cs.AI·18d ago·source ↗

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.

Evaluation and Benchmarking Alignment and RLHF Perceptually Perturbed Judgment Dataset Multimodal Large Language Models GRPO +3 more