A new arXiv preprint introduces Logit-Contribution Scoring (LOCOS), a method for identifying attention heads responsible for non-literal retrieval in long-context LLMs — cases where models synthesize answers from meaning rather than copying tokens verbatim. Existing detectors fail at this task because they rely on a literal-copy criterion that misses the output-value (OV) circuit mechanism. Evaluated across Qwen3, Gemma-3, and OLMo-3.1, LOCOS outperforms prior attention-based detectors on the NoLiMa benchmark, with ablation of 50 heads on Qwen3-8B collapsing ROUGE-L from 0.401 to 0.000 while the best baseline retains 0.292. The identified heads are retrieval-specific, leaving parametric recall and arithmetic reasoning unaffected.
Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.
Loong is a long document translation agent that uses a 3E memory module (Essence-Exemplar-Entity) to store structured historical context, replacing passive full-context attention with RL-optimized adaptive context selection. The agent learns its context retrieval policy via reinforcement learning on self-sampled reasoning trajectories. Evaluations show average gains of up to 13.0 points across three metrics in English↔Chinese, German, and French translation directions, with strong generalization and robustness to noise in ultra-long documents.
Researchers introduce LoSoNA, a benchmark for testing whether LLM-based agents can infer and adapt to unstated local conversational norms in multi-party chat scenarios. Each scenario presents a group-chat transcript where non-subject participants implicitly demonstrate a hidden norm, followed by an elicitor turn. Eight frontier and open-weight models are evaluated under four prompting conditions; naive prompting performs poorly for most models, while explicit norm-aware prompting yields uneven gains—Gemini 3.1 Pro reaches 84.2% and Claude Fable 5 reaches 81.6%. The work contributes to growing interest in evaluating LLM social and pragmatic capabilities beyond factual or reasoning tasks.
This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.
Researchers introduce LexNeo-Bench, a 3,050-instance benchmark for evaluating LLM performance on lexical borrowing classification and neology detection in Luxembourgish, a low-resource contact language. Three multilingual LLMs are tested across 34 prompt configurations; without external context, models perform near chance on borrowing classification (25–35%). Injecting instance-specific subgraphs from a linguistic knowledge graph raises accuracy to 71–81% and largely closes the gap between small and large models, though neology detection remains difficult. The study highlights the value of lexicon-aware, structured prompting for low-resource multilingual evaluation.
Researchers introduce RECONTEXT, a training-free inference-time method for improving long-context reasoning in LLMs. The approach uses model-internal relevance signals to build a query-conditioned evidence pool that is replayed before final generation, without modifying the original context, external memory, or context pruning. Experiments across eight long-context datasets at 128K context length show consistent improvements on Qwen3-4B, Qwen3-8B, and Llama3-8B. The authors provide a theoretical grounding via associative memory theory, framing attention as cue-trace association and replay as trace reactivation.
Researchers introduce LOGOS (Language Of Generative Objects in Science), a generative language model that encodes heterogeneous scientific objects and spatial interactions as discrete token sequences within a single autoregressive framework, avoiding explicit coordinates or geometric neural networks. Models are trained at 1B, 3B, and 8B parameter scales and consistently match or outperform domain-specific baselines across diverse scientific tasks. The work argues that AI for Science should converge on shared architectures and training paradigms with LLMs rather than maintaining a separate technical stack. Model weights are released publicly.
A new arXiv paper evaluates Turkish idiomatic light verb construction (LVC) detection as a binary classification task, comparing a supervised BERTurk baseline against three instruction-tuned LLMs under zero-shot, one-shot, and few-shot prompting. Results show LLMs have very low LVC recall in zero-shot but improve substantially with demonstrations, though one-shot prompting can introduce strong model-specific biases. The supervised baseline remains competitive, while carefully constructed few-shot prompts allow GPT-OSS-20B and Qwen 2.5-14B to match or exceed it. The study highlights significant prompt sensitivity in Turkish metalinguistic classification tasks.