5arXiv cs.CL (Computation and Language)·4d ago

Contrastive-Difference CKA reveals concept-specific structural alignment across LLM architectures

Researchers introduce CKA_Delta (contrastive-difference CKA), a training-free diagnostic that isolates concept-specific representational convergence from generic similarity across LLM architectures. The method reveals a geometric-functional universality dissociation: moderate geometric alignment coexists with near-perfect functional transfer across six concept domains and multiple architectural families. CKA_Delta also functions as an architectural outlier detector, flagging Gemma as a notable outlier (d=1.08, AUC=0.79). The work provides a practical tool for cross-architecture concept monitoring without requiring model training.

Evaluation and Benchmarking AI Safety Research CKA_Delta Gemma Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·11d ago·source ↗

KATE framework improves LLM tool calling via experiential knowledge integration and parallel reasoning

Researchers present KATE (Knowledge-Augmented Tool Execution), a framework addressing LLM failures in multi-step tool use by systematically studying knowledge acquisition, activation, and internalization. Key findings include that instance-level experiential knowledge outperforms abstract intent-level knowledge, that expanding reasoning width via parallel sampling with aggregation beats deeper chain-of-thought, and that reinforcement learning outperforms supervised fine-tuning for knowledge internalization. KATE is evaluated on BFCL-V3 and AppWorld benchmarks, showing consistent improvements over strong baselines across model scales.

Evaluation and Benchmarking Agent and Tool Ecosystem BFCL-V3 AppWorld Knowledge-Augmented Tool Execution +1 more

6arXiv · cs.AI·29d ago·source ↗

LCGuard: Adversarial Training Framework for Safe KV Cache Sharing in Multi-Agent LLM Systems

LCGuard introduces a framework for preventing sensitive information leakage when multi-agent LLM systems share KV caches as a latent communication channel. The approach formalizes leakage operationally via reconstruction: a shared cache artifact is deemed unsafe if an adversarial decoder can recover sensitive inputs from it. An adversarial training loop pits a reconstructor against LCGuard's representation-level transformations, which aim to preserve task-relevant semantics while suppressing recoverable sensitive content. Empirical results across multiple model families and multi-agent benchmarks show reduced reconstruction-based leakage and attack success rates with competitive task performance.

Inference Economics AI Safety Research KV Cache representation-level sensitive information leakage LCGuard +4 more

6arXiv · cs.CL·46h ago·source ↗

Activation-space directions for detecting and mitigating emergent misalignment across LLM families

Researchers fine-tuned four small instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3B) on insecure code to induce emergent misalignment, then investigated whether a shared activation-space direction could detect and correct it. A difference-in-means direction achieves 99.6% separation of aligned vs. misaligned activations within each model, and causal steering by subtracting this direction reduces misaligned behavior by 21–51 points. Cross-architecture transfer via ridge regression yields large behavioral suppression but fails specificity controls, revealing a two-tier structure: within-model directions are causally specific and actionable, while cross-model directions are real but non-specific. The findings bound the utility of linear cross-architecture correction and recommend within-model probing for safety auditing.

Evaluation and Benchmarking AI Safety Research Llama 3.2 Gemma 2 Qwen2.5-1.5B +4 more

4arXiv · cs.CL·25d ago·source ↗

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

This paper empirically validates a creative quality metric from a companion work (Calibrated Surprise, Zou & Xu 2026a) under strict low-resource conditions: ~100 expert chain-of-thought annotations and a small base model. The authors introduce Creative Quality Alignment (CQA) as a class of engineering methods and identify a systematic bias in public alignment datasets toward craft knowledge, with weak coverage of audience modeling and reality-logic. A theoretical argument based on 'architectural duality' in single conditional distribution LLMs is offered to explain why so few examples suffice, distinguishing the result from purely empirical findings like LIMA.

Evaluation and Benchmarking Alignment and RLHF BC Protocol Creative Quality Alignment (CQA)Zou +4 more

6arXiv · cs.CL·24d ago·source ↗

MATCHA: Contrastive Semantic Alignment Metric for LLM Evaluation

MATCHA is a new automatic evaluation metric for LLMs that addresses a fundamental flaw in existing metrics: both token-overlap (ROUGE) and embedding-based (BERTScore) metrics routinely assign near-identical scores to semantically contradictory texts. The metric uses a dual-view approach that rewards proximity to a gold reference while penalizing adversarially generated counterfactual contradictions. Evaluated across eight benchmarks spanning QA, summarization, NLI, and semantic similarity tasks, MATCHA outperforms 23 embedding models and achieves 18.38% and 20.82% improvements over ROUGE-L and BERTScore respectively on TruthfulQA. Code and metric are publicly released.

Evaluation and Benchmarking AI Safety Research TruthfulQA ROUGE-L Siran Li +3 more

6arXiv · cs.LG·8d ago·source ↗

Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs

Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.

Evaluation and Benchmarking AI Safety Research operadic consistency Chain-of-Thought Self-Consistency MuSiQue +6 more

7arXiv · cs.AI·29d ago·source ↗

The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance

This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.

Evaluation and Benchmarking AI Safety Research Invariant Risk Minimization Matching Principle Qwen2.5-7B +5 more

7arXiv · cs.CL·11d ago·source ↗

Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale

Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.

Long Context Evolution Inference Economics End-to-End Context Compression at Scale Latent Context Language Models +1 more