3arXiv cs.CL (Computation and Language)·2d ago

ROMEVA: Geometry-preserving vocabulary expansion for Roman Urdu language models

Researchers propose ROMEVA, a method combining sub-word-average initialization and PCA-guided anchor loss to stabilize embeddings when expanding mBERT's vocabulary for Roman Urdu, a morphologically inconsistent low-resource language with high sub-word fragmentation. The method is evaluated on a 36,130-comment corpus with 500 new tokens added to mBERT. A notable finding is that while ROMEVA best preserves the pretrained embedding geometry, naive fine-tuning outperforms it on downstream sentiment classification, revealing a disconnect between embedding stability and task performance.

Open Weights Progress ROMEVA mBERT

Related guides (1)

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Related events (8)

3arXiv · cs.CL·5d ago·source ↗

IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus

Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.

Evaluation and Benchmarking Sepahr-Danesh RoBERTa PQuAD +3 more

5arXiv · cs.CL·46h ago·source ↗

Randomized YaRN improves LLM length generalization for long-context reasoning

Researchers propose Randomized YaRN, a training method that combines YaRN-based positional extrapolation with randomized positional encodings and a length curriculum to improve LLM generalization to long contexts. Models trained on sequences under 8K tokens show consistent reasoning improvements on context lengths from 16K to 128K on BABILong and MRCR benchmarks. The key insight is that exposing models to out-of-distribution positional representations during short-context training enables better generalization at far longer inference-time lengths.

Long Context Evolution Evaluation and Benchmarking BABILong Multi-Round Coreference Resolution YaRN +1 more

4arXiv · cs.CL·46h ago·source ↗

LLM embedding spaces partially recover expert-defined symptom structure in mental health language

A new arXiv preprint investigates whether LLM embedding geometry aligns with expert-defined symptom structure in mental health language, using 28 Reddit communities as a testbed. The authors compare pretrained and fine-tuned Qwen3 embeddings (0.6B and 4B) against an expert symptom matrix via representational similarity analysis, with controls for affective, stylistic, and topic confounds. Results show measurable but level-dependent alignment: fine-tuning strengthens it at fine-grained category levels, and larger scale improves both zero-shot alignment and fine-tuning gains. The paper argues that classification accuracy alone is insufficient to validate embedding geometry against domain knowledge.

Evaluation and Benchmarking Reddit Do LLM Embedding Spaces Recover Expert Structure?Qwen3

4arXiv · cs.CL·8d ago·source ↗

Transformer embeddings shown to intrinsically encode Russell's circumplex model of emotion geometry

A new arXiv paper investigates whether Transformer-based text and speech encoders (RoBERTa, wav2vec 2.0) recover the geometric structure of Russell's circumplex model of affect — a valence-arousal topology from psychology. Experiments on naturalistic datasets (MSP-Podcast) and LLM-generated stimuli show that multimodal fusion achieves perfect topological alignment with Russell's primary emotion ordering, and zero-shot generic text embeddings place fine-grained emotion terms near their human-mapped coordinates. The authors argue this structure is intrinsically encoded in the representations rather than being an artifact of labeling, bridging psychological theory and representation learning.

Evaluation and Benchmarking Multimodal Progress Data-Driven Decoding of Russell's Circumplex Model of Affect RoBERTa MSP-Podcast +1 more

4arXiv · cs.CL·12d ago·source ↗

Embedding interpolation study reveals structured benefits of mixed-language queries in multilingual dense retrieval

A ratio-controlled study on mMARCO evaluates how mixing proportions of parallel query translations via embedding-level interpolation affect multilingual dense retrieval performance. Using BGE-M3, the authors find that an optimal mixing ratio outperforms the best monolingual endpoint in 88 of 105 cases, with a clear asymmetry driven by English dominance. Mixing is uniformly beneficial for non-English document indices, while English-containing indices are best served by pure English queries, and mixing gains correlate negatively with typological distance when controlling for English dominance.

Evaluation and Benchmarking BGE-M3 mMARCO When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval

4arXiv · cs.CL·15d ago·source ↗

Attention Expansion mechanism improves keyphrase extraction from long documents without full-context LLMs

Researchers propose an 'attention expansion' mechanism that augments pre-trained language model token representations with information from out-of-context chunks using static word embeddings, enabling more effective keyphrase extraction from long documents. The approach avoids the computational cost of full-document attention or LLM-based inference while expanding the effective contextual scope of PLM-based models. Evaluated across five PLM backbones and five benchmark corpora, the method consistently improves F1 scores over state-of-the-art baselines in both scientific and news domains.

Long Context Evolution Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

5arXiv · cs.CL·16d ago·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

Evaluation and Benchmarking Multimodal Progress DOCCI MS COCO IIW +4 more

4arXiv · cs.CL·2d ago·source ↗

Variance-Calibrated Modulation (VCM): training-free decoding intervention to address LLM likelihood trap

Researchers propose Variance-Calibrated Modulation (VCM), a training-free pre-decoding method that reshapes LLM probability distributions before truncation to combat repetitive degeneration and vocabulary dullness. VCM combines two mechanisms: Contextual Searchlight via PMI (suppressing stopwords, elevating context-relevant tokens) and Adaptive Self-Debiasing (scale-invariant penalization using real-time logit standard deviation). Evaluated across open-ended generation, factual QA, and mathematical reasoning, VCM improves diversity, coherence, and reasoning accuracy at higher temperatures with negligible overhead. The method is compatible with existing decoding strategies like Top-p and Min-p.

Evaluation and Benchmarking Inference Economics Adaptive Self-Debiasing Variance-Calibrated Modulation Contextual Searchlight via PMI