ACROS: Inducing Sense Representations as Interfaces into Frozen Pretrained LMs
ACROS introduces a method to add explicit sense representations (per-token meaning decompositions) to frozen pretrained decoder language models via a gated residual addition, without requiring sense-aware pretraining. On SmolLM2-360M, the induced sense pathway supports zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL), lexical steering across 5,161 CoInCo cases, and cross-lingual adaptation to four languages with near-perfect retrieval accuracy. The approach preserves base LM quality while enabling three distinct downstream uses from the same induced variables, positioning sense representations as a post-hoc modular interface for ordinary pretrained models.
Related guides (2)
Related events (8)
LOGOS: A unified autoregressive foundation model for natural science tasks across domains
Researchers introduce LOGOS (Language Of Generative Objects in Science), a generative language model that encodes heterogeneous scientific objects and spatial interactions as discrete token sequences within a single autoregressive framework, avoiding explicit coordinates or geometric neural networks. Models are trained at 1B, 3B, and 8B parameter scales and consistently match or outperform domain-specific baselines across diverse scientific tasks. The work argues that AI for Science should converge on shared architectures and training paradigms with LLMs rather than maintaining a separate technical stack. Model weights are released publicly.
CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup
Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.
SDXL in 4 Steps with Latent Consistency LoRAs
Hugging Face demonstrates combining Latent Consistency Models (LCMs) with LoRA adapters to enable high-quality image generation with Stable Diffusion XL in as few as 4 inference steps. This approach dramatically reduces the number of diffusion steps required compared to standard SDXL, lowering inference latency and compute cost. The technique leverages consistency distillation applied via lightweight LoRA weights, making it accessible without full model retraining.
Synthetic LLM-generated conversations improve ASR training for low-resource languages
Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.
Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions
This paper investigates whether language models can learn the semantics of rare English constructions (e.g., 'let alone', 'much less'), constructing a novel dataset to test form-meaning pairing understanding. Testing models across parameter counts, architectures, and pretraining dataset sizes, the authors find that modestly sized open-source models can grasp Paired-Focus construction semantics, while models trained on human-scale data fail. Training dynamics analysis reveals that semantic understanding of these constructions emerges later than syntactic knowledge and correlates with gains in world knowledge more broadly.
ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs
Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.
ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens
ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

