5arXiv cs.CL (Computation and Language)·41h ago

LOCOS: Logit-Contribution Scoring identifies non-literal retrieval heads in long-context LLMs

A new arXiv preprint introduces Logit-Contribution Scoring (LOCOS), a method for identifying attention heads responsible for non-literal retrieval in long-context LLMs — cases where models synthesize answers from meaning rather than copying tokens verbatim. Existing detectors fail at this task because they rely on a literal-copy criterion that misses the output-value (OV) circuit mechanism. Evaluated across Qwen3, Gemma-3, and OLMo-3.1, LOCOS outperforms prior attention-based detectors on the NoLiMa benchmark, with ablation of 50 heads on Qwen3-8B collapsing ROUGE-L from 0.401 to 0.000 while the best baseline retains 0.292. The identified heads are retrieval-specific, leaving parametric recall and arithmetic reasoning unaffected.

Long Context Evolution Evaluation and Benchmarking MuSiQue OLMo-3 Gemma-3-4B-IT Logit-Contribution Scoring NoLiMa Qwen3 BABI-Long

Related guides (2)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·Jun 12, 2026·source ↗

Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs

Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.

Evaluation and Benchmarking AI Safety Research operadic consistency Chain-of-Thought Self-Consistency MuSiQue +6 more

5arXiv · cs.CL·May 29, 2026·source ↗

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong is a long document translation agent that uses a 3E memory module (Essence-Exemplar-Entity) to store structured historical context, replacing passive full-context attention with RL-optimized adaptive context selection. The agent learns its context retrieval policy via reinforcement learning on self-sampled reasoning trajectories. Evaluations show average gains of up to 13.0 points across three metrics in English↔Chinese, German, and French translation directions, with strong generalization and robustness to noise in ultra-long documents.

Long Context Evolution Agent and Tool Ecosystem YutongWang1216 3E Memory Module Reinforcement Learning +3 more

4arXiv · cs.CL·Jun 15, 2026·source ↗

LoSoNA benchmark evaluates LLM adaptation to implicit local social norms in group chats

Researchers introduce LoSoNA, a benchmark for testing whether LLM-based agents can infer and adapt to unstated local conversational norms in multi-party chat scenarios. Each scenario presents a group-chat transcript where non-subject participants implicitly demonstrate a hidden norm, followed by an elicitor turn. Eight frontier and open-weight models are evaluated under four prompting conditions; naive prompting performs poorly for most models, while explicit norm-aware prompting yields uneven gains—Gemini 3.1 Pro reaches 84.2% and Claude Fable 5 reaches 81.6%. The work contributes to growing interest in evaluating LLM social and pragmatic capabilities beyond factual or reasoning tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.1 Pro Claude Fable 5 LoSoNA

6arXiv · cs.CL·May 29, 2026·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

4arXiv · cs.CL·May 21, 2026·source ↗

LexNeo-Bench: Probing LLM Knowledge of Lexical Borrowing in Luxembourgish via Knowledge-Graph Prompting

Researchers introduce LexNeo-Bench, a 3,050-instance benchmark for evaluating LLM performance on lexical borrowing classification and neology detection in Luxembourgish, a low-resource contact language. Three multilingual LLMs are tested across 34 prompt configurations; without external context, models perform near chance on borrowing classification (25–35%). Injecting instance-specific subgraphs from a linguistic knowledge graph raises accuracy to 71–81% and largely closes the gap between small and large models, though neology detection remains difficult. The study highlights the value of lexicon-aware, structured prompting for low-resource multilingual evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem LexNeo-Bench knowledge graph prompting LuxBorrow +2 more

5arXiv · cs.AI·17h ago·source ↗

ReContext: Training-free recursive evidence replay improves LLM long-context reasoning

Researchers introduce RECONTEXT, a training-free inference-time method for improving long-context reasoning in LLMs. The approach uses model-internal relevance signals to build a query-conditioned evidence pool that is replayed before final generation, without modifying the original context, external memory, or context pruning. Experiments across eight long-context datasets at 128K context length show consistent improvements on Qwen3-4B, Qwen3-8B, and Llama3-8B. The authors provide a theoretical grounding via associative memory theory, framing attention as cue-trace association and replay as trace reactivation.

Long Context Evolution Agent and Tool Ecosystem Llama3-8B Yanjun Zhao Qwen3-4B +1 more

6arXiv · cs.CL·Jun 16, 2026·source ↗

LOGOS: A unified autoregressive foundation model for natural science tasks across domains

Researchers introduce LOGOS (Language Of Generative Objects in Science), a generative language model that encodes heterogeneous scientific objects and spatial interactions as discrete token sequences within a single autoregressive framework, avoiding explicit coordinates or geometric neural networks. Models are trained at 1B, 3B, and 8B parameter scales and consistently match or outperform domain-specific baselines across diverse scientific tasks. The work argues that AI for Science should converge on shared architectures and training paradigms with LLMs rather than maintaining a separate technical stack. Model weights are released publicly.

Frontier Model Releases Open Weights Progress Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences LOGOS

3arXiv · cs.CL·Jun 8, 2026·source ↗

Supervised vs. in-context learning for Turkish multiword expression classification

A new arXiv paper evaluates Turkish idiomatic light verb construction (LVC) detection as a binary classification task, comparing a supervised BERTurk baseline against three instruction-tuned LLMs under zero-shot, one-shot, and few-shot prompting. Results show LLMs have very low LVC recall in zero-shot but improve substantially with demonstrations, though one-shot prompting can introduce strong model-specific biases. The supervised baseline remains competitive, while carefully constructed few-shot prompts allow GPT-OSS-20B and Qwen 2.5-14B to match or exceed it. The study highlights significant prompt sensitivity in Turkish metalinguistic classification tasks.

Evaluation and Benchmarking Qwen2.5-7B BERTurk gpt-oss-20b

LOCOS: Logit-Contribution Scoring identifies non-literal retrieval heads in long-context LLMs

Related events (8)

6arXiv · cs.LG·Jun 12, 2026·source ↗