6arXiv cs.LG (Machine Learning)·19h ago

LACUNA testbed introduces ground-truth parameter-level evaluation for LLM unlearning

Researchers introduce LACUNA, the first unlearning testbed with ground-truth parameter-level localization, designed to evaluate whether LLM unlearning methods truly erase knowledge from model weights or merely suppress it at the output level. The testbed injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct measurement of localization precision. Benchmarking current SOTA unlearning methods reveals they are highly imprecise and vulnerable to resurfacing attacks despite strong output-level performance, while successful localization enables even simple gradient-based methods to achieve robust erasure. The work addresses a critical gap in unlearning evaluation methodology relevant to privacy compliance and AI safety.

Evaluation and Benchmarking AI Safety Research OLMo LACUNA LACUNA Allen Institute for AI

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Principles to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·Jun 18, 2026·source ↗

MAST: Mechanism-guided selective unlearning for RLVR-trained reasoning models

Researchers introduce MAST (Mechanism-Aligned Selective Targeting), a method for selectively unlearning capabilities induced by reinforcement learning from verifiable rewards (RLVR) in language models while minimizing collateral damage to retained knowledge. The approach ranks attention-projection tensors by off-principal energy and gradient coupling to identify a targeted subset for update, rather than applying full-parameter gradient ascent. Evaluated on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, MAST achieves statistically significant forgetting on target MATH problems while preserving GSM8K performance, whereas full-parameter unlearning collapses retained capabilities. The method generalizes across seeds and unlearning objectives (NPO/SimNPO).

AI Safety Research Alignment and RLHF Qwen3-1.7B-Base MATH MAST +2 more

6arXiv · cs.CL·Jun 3, 2026·source ↗

Backdoor unlearning in LLMs generalizes across unknown triggers via cross-backdoor transfer

Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.

AI Safety Research Alignment and RLHF Cross Activation Shift Distance Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

5arXiv · cs.CL·Jun 5, 2026·source ↗

ATWU: Token-level importance learning improves LLM unlearning via retain-conflict criterion

This paper introduces Alternating Token-Weighted Unlearning (ATWU), a framework that learns which tokens in a forget sample are most relevant to unlearning by characterizing their conflict with the retain objective. Rather than relying on auxiliary models or heuristics, ATWU jointly learns token forget-specificity and model parameters using a lightweight linear scorer over hidden states. Evaluated on TOFU and RWKU benchmarks, ATWU achieves state-of-the-art forget-retain trade-offs and produces token-level scores that align with ground-truth forget-specific spans.

Evaluation and Benchmarking AI Safety Research RWKU Alternating Token-Weighted Unlearning TOFU

6arXiv · cs.CL·Jun 23, 2026·source ↗

Uncertainty-Based Decontamination (UBD) framework for removing benchmark contamination from LLMs

Researchers propose Uncertainty-Based Decontamination (UBD), a method that uses deep ensembles of a contaminated model to estimate per-sample memorization and correct for benchmark data contamination without requiring access to an uncontaminated reference model. The approach introduces a sample-level evaluation framework using distributional distance metrics alongside aggregate accuracy to better characterize decontamination quality. Experiments on MMLU-Pro and MATH-MCQA show UBD produces output distributions closer to uncontaminated baselines than paraphrasing or choice-permutation methods. The work addresses a significant validity concern in LLM evaluation, where contamination inflates reported benchmark performance.

Evaluation and Benchmarking AI Safety Research Uncertainty-based Debiasing and Unlearning for Decontamination MATH-MCQA MMLU-Pro

6arXiv · cs.CL·Jun 9, 2026·source ↗

Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs

Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.

Evaluation and Benchmarking AI Safety Research Clinically Grounded Privacy Evaluation of Medical LMs +1 more

4arXiv · cs.CL·Jun 1, 2026·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Ollama GPT-5.2 DeepL +8 more

6arXiv · cs.AI·2d ago·source ↗

Reinforcement Learning with Metacognitive Feedback (RLMF) improves LLM calibration and uncertainty expression

Researchers introduce Reinforcement Learning with Metacognitive Feedback (RLMF), a training paradigm that refines preference optimization using a model's self-judgments of its own performance quality. The method is applied to faithful calibration — aligning a model's expressed confidence with its intrinsic uncertainty — and achieves state-of-the-art results across diverse tasks while outperforming standard RL by up to 63%. A companion technique, metacognitive data selection, uses similar self-judgments to identify high-value training examples, outperforming naive active learning baselines. The work positions metacognitive performance as a novel and effective RL signal for improving LLM reliability and alignment.

Evaluation and Benchmarking AI Safety Research Reinforcement Learning with Metacognitive Feedback metacognitive data selection Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs +1 more

4arXiv · cs.CL·Jun 15, 2026·source ↗

LoSoNA benchmark evaluates LLM adaptation to implicit local social norms in group chats

Researchers introduce LoSoNA, a benchmark for testing whether LLM-based agents can infer and adapt to unstated local conversational norms in multi-party chat scenarios. Each scenario presents a group-chat transcript where non-subject participants implicitly demonstrate a hidden norm, followed by an elicitor turn. Eight frontier and open-weight models are evaluated under four prompting conditions; naive prompting performs poorly for most models, while explicit norm-aware prompting yields uneven gains—Gemini 3.1 Pro reaches 84.2% and Claude Fable 5 reaches 81.6%. The work contributes to growing interest in evaluating LLM social and pragmatic capabilities beyond factual or reasoning tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.1 Pro Claude Fable 5 LoSoNA

LACUNA testbed introduces ground-truth parameter-level evaluation for LLM unlearning

Related events (8)

5arXiv · cs.AI·Jun 18, 2026·source ↗