paper
Speaker Group Encoding in Self-supervised Speech Recognition Models
paperactiveprovisional
speaker-group-encoding-in-self-supervised-speech-recognition-models-f18dba7b·1 events·first seen 7d agoAliases: Speaker Group Encoding in Self-supervised Speech Recognition Models
More like this (12)
From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofingspeaker-attribute classificationLeveraging Audio-LLMs to Filter Speech-to-Speech Training DataAcoustic Cue Alignment in Audio Language Models for Speech Emotion RecognitionSparse AutoencodersSelf-Supervised PretrainingCross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and LipreadingMulti-Faceted Interactivity Alignment in Full-Duplex Speech ModelsSelf-Supervised Learningencoder-only language modelsBeyond task performance: Decoding bioacoustic embeddings with speech featuresSparse Embedding Models
Recent events (1)
Study reveals how self-supervised speech models encode speaker group attributes across fine-tuning stages
Researchers investigate what self-supervised speech recognition models (S3Ms) learn about speaker group categories including gender, age, dialect, ethnicity, and native-speaker status across pretrained, SID-finetuned, ASR-finetuned, and fairness-enhanced states. They find that SID fine-tuning amplifies phonetically variant speaker group information while ASR fine-tuning discards it but retains semantically variant information. Fairness-enhancing ASR algorithms primarily affect phonetically variant speaker group encoding but have limited impact on semantically variant categories. The findings offer guidance for designing fairer ASR systems.