Researchers introduce LACUNA, the first unlearning testbed with ground-truth parameter-level localization, designed to evaluate whether LLM unlearning methods truly erase knowledge from model weights or merely suppress it at the output level. The testbed injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct measurement of localization precision. Benchmarking current SOTA unlearning methods reveals they are highly imprecise and vulnerable to resurfacing attacks despite strong output-level performance, while successful localization enables even simple gradient-based methods to achieve robust erasure. The work addresses a critical gap in unlearning evaluation methodology relevant to privacy compliance and AI safety.
Researchers introduce MAST (Mechanism-Aligned Selective Targeting), a method for selectively unlearning capabilities induced by reinforcement learning from verifiable rewards (RLVR) in language models while minimizing collateral damage to retained knowledge. The approach ranks attention-projection tensors by off-principal energy and gradient coupling to identify a targeted subset for update, rather than applying full-parameter gradient ascent. Evaluated on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, MAST achieves statistically significant forgetting on target MATH problems while preserving GSM8K performance, whereas full-parameter unlearning collapses retained capabilities. The method generalizes across seeds and unlearning objectives (NPO/SimNPO).
Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.
This paper introduces Alternating Token-Weighted Unlearning (ATWU), a framework that learns which tokens in a forget sample are most relevant to unlearning by characterizing their conflict with the retain objective. Rather than relying on auxiliary models or heuristics, ATWU jointly learns token forget-specificity and model parameters using a lightweight linear scorer over hidden states. Evaluated on TOFU and RWKU benchmarks, ATWU achieves state-of-the-art forget-retain trade-offs and produces token-level scores that align with ground-truth forget-specific spans.
Researchers propose Uncertainty-Based Decontamination (UBD), a method that uses deep ensembles of a contaminated model to estimate per-sample memorization and correct for benchmark data contamination without requiring access to an uncontaminated reference model. The approach introduces a sample-level evaluation framework using distributional distance metrics alongside aggregate accuracy to better characterize decontamination quality. Experiments on MMLU-Pro and MATH-MCQA show UBD produces output distributions closer to uncontaminated baselines than paraphrasing or choice-permutation methods. The work addresses a significant validity concern in LLM evaluation, where contamination inflates reported benchmark performance.
Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.
This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.
Researchers introduce Reinforcement Learning with Metacognitive Feedback (RLMF), a training paradigm that refines preference optimization using a model's self-judgments of its own performance quality. The method is applied to faithful calibration — aligning a model's expressed confidence with its intrinsic uncertainty — and achieves state-of-the-art results across diverse tasks while outperforming standard RL by up to 63%. A companion technique, metacognitive data selection, uses similar self-judgments to identify high-value training examples, outperforming naive active learning baselines. The work positions metacognitive performance as a novel and effective RL signal for improving LLM reliability and alignment.
Researchers introduce LoSoNA, a benchmark for testing whether LLM-based agents can infer and adapt to unstated local conversational norms in multi-party chat scenarios. Each scenario presents a group-chat transcript where non-subject participants implicitly demonstrate a hidden norm, followed by an elicitor turn. Eight frontier and open-weight models are evaluated under four prompting conditions; naive prompting performs poorly for most models, while explicit norm-aware prompting yields uneven gains—Gemini 3.1 Pro reaches 84.2% and Claude Fable 5 reaches 81.6%. The work contributes to growing interest in evaluating LLM social and pragmatic capabilities beyond factual or reasoning tasks.