Almanac
technique

ROMEVA

techniqueactiveprovisionalromeva-dd5251e9·1 events·first seen 2d ago

Aliases: ROMEVA

Co-occurring entities

More like this (12)

Recent events (1)

3arXiv · cs.CL·2d ago·source ↗

ROMEVA: Geometry-preserving vocabulary expansion for Roman Urdu language models

Researchers propose ROMEVA, a method combining sub-word-average initialization and PCA-guided anchor loss to stabilize embeddings when expanding mBERT's vocabulary for Roman Urdu, a morphologically inconsistent low-resource language with high sub-word fragmentation. The method is evaluated on a 36,130-comment corpus with 500 new tokens added to mBERT. A notable finding is that while ROMEVA best preserves the pretrained embedding geometry, naive fine-tuning outperforms it on downstream sentiment classification, revealing a disconnect between embedding stability and task performance.