3arXiv cs.CL (Computation and Language)·12h ago

DG^VoiC: Speaker clustering framework for fraud detection in call-centre audio

Researchers present DG^VoiC, a voice clustering framework designed to identify repeated speakers across anonymised call-centre recordings for insurance fraud investigation. The system combines anonymisation-aligned preprocessing, sliding-window speaker embeddings, and cosine similarity clustering, evaluated on 121 real telephony recordings. On a curated 56-sample reference set, the best configuration achieves 96% AMI, 95% ARI, and 100% homogeneity, suggesting speaker identity is a viable underutilised signal for fraud detection workflows.

Enterprise Deployment Patterns DG^VoiC

Related guides (1)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Related events (8)

3arXiv · cs.CL·12h ago·source ↗

Multimodal NLP pipeline for insurance fraud detection at FNOL using synthetic dialogue and audio

A new arXiv preprint introduces a synthetic multimodal framework for insurance fraud detection at the First Notice of Loss (FNOL) stage, combining ASR, speaker diarisation, NER, regex extraction, LLM-RAG retrieval, and speaker embeddings into a rule-based risk scoring system. The framework generates synthetic agent-customer dialogue transcripts and two-speaker audio to address the scarcity of multimodal fraud datasets. Component-level evaluations show stability and transfer potential, offering a reproducible baseline for multimodal fraud detection research.

Multimodal Progress Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection

3arXiv · cs.CL·5d ago·source ↗

First Turkish phone scam detection dataset evaluated across seven LLMs in multi-modal settings

Researchers introduce the first public multi-modal dataset of 100 aligned audio-transcript pairs of Turkish scam and benign phone calls, evaluating seven LLMs (Gemini 2.5 Flash/Flash-Lite/Pro, GPT-4o, Qwen Max/Plus/Turbo) under three input conditions. Transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. The work addresses a gap in low-resource language safety research and highlights the need for linguistically inclusive fraud detection systems.

AI Safety Research Multimodal Progress Google GPT-4o Gemini-2.5-Flash-Lite +3 more

4Hugging Face Blog·1mo ago·source ↗

Voice Cloning with Consent

Hugging Face published a blog post addressing consent mechanisms for voice cloning technology. The post appears to discuss frameworks or tooling for ensuring user consent before voice data is used for cloning purposes. This touches on safety, ethics, and deployment patterns for voice synthesis models.

AI Safety Research Enterprise Deployment Patterns Hugging Face consent gate voice cloning

4arXiv · cs.CL·20d ago·source ↗

Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading

Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.

Multimodal Progress Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

5arXiv · cs.AI·19d ago·source ↗

Explainability pipeline reveals divergent cues used by deepfake speech detectors

Researchers propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence in deepfake speech detectors. Applied to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on the ASVspoof 5 benchmark, the method reveals that despite similar performance, each detector relies on fundamentally different cues: environmental noise, phoneme artifacts, and word boundaries respectively. Findings are validated via causal masking experiments that confirm performance degrades when primary cues are removed. The work advances interpretability of audio deepfake detection, relevant to AI safety and media authenticity.

Evaluation and Benchmarking AI Safety Research CA-MHFA Integrated Gradients SLS +4 more

5arXiv · cs.CL·17d ago·source ↗

ModeratorLM: Role-conditioned turn-taking for multi-party voice agents with 40%+ precision gains

Researchers introduce ModeratorLM, a voice agent system that conditions turn-taking behavior on an explicitly assigned conversational role in multi-party settings, built on a streaming speech LLM. A reasoning-augmented variant adds chain-of-thought over conversational context. Evaluated on real-world meeting data and the new RolePlayConv synthetic dataset, the system achieves over 40% improvement in turn-taking precision and 70% in recall while reducing false-positive interruptions versus non-role-conditioned baselines.

Agent and Tool Ecosystem Multimodal Progress ModeratorLM RolePlayConv Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

3arXiv · cs.AI·14d ago·source ↗

MoE architecture improves self-supervised speech model robustness for anti-spoofing

Researchers propose converting a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization in synthetic speech detection. Feed-forward blocks in selected encoder layers are replaced by expert networks with a layer-wise gating mechanism, allowing complementary acoustic pattern capture while preserving pretrained representations. Evaluated across 14 spoofing datasets, the approach reduces macro Equal Error Rate from 5.46% to 4.81%, an 11.9% relative improvement over the baseline.

Evaluation and Benchmarking Mixture of Experts From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

3arXiv · cs.CL·21d ago·source ↗

KIT submission to IWSLT 2026 cross-lingual voice cloning track with language tag prompting and RL fine-tuning

Researchers from KIT describe their system for the IWSLT 2026 Cross-Lingual Voice Cloning shared task, which aims to synthesize speech in a target language while preserving source-speaker identity. The system builds on FishAudio-S2-Pro, a multilingual TTS model, and introduces language tag prompting to reduce accent leakage, RL fine-tuning for intelligibility, and a reference-conditioned lexical matching method for domain-specific pronunciation. Language prompting yields the largest gains; lexical matching provides consistent improvements on matched subsets.

Multimodal Progress IWSLT 2026 Cross-Lingual Voice Cloning FishAudio-S2-Pro Karlsruhe Institute of Technology