3arXiv cs.CL (Computation and Language)·13h ago

Multimodal NLP pipeline for insurance fraud detection at FNOL using synthetic dialogue and audio

A new arXiv preprint introduces a synthetic multimodal framework for insurance fraud detection at the First Notice of Loss (FNOL) stage, combining ASR, speaker diarisation, NER, regex extraction, LLM-RAG retrieval, and speaker embeddings into a rule-based risk scoring system. The framework generates synthetic agent-customer dialogue transcripts and two-speaker audio to address the scarcity of multimodal fraud datasets. Component-level evaluations show stability and transfer potential, offering a reproducible baseline for multimodal fraud detection research.

Multimodal Progress Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

3arXiv · cs.CL·13h ago·source ↗

DG^VoiC: Speaker clustering framework for fraud detection in call-centre audio

Researchers present DG^VoiC, a voice clustering framework designed to identify repeated speakers across anonymised call-centre recordings for insurance fraud investigation. The system combines anonymisation-aligned preprocessing, sliding-window speaker embeddings, and cosine similarity clustering, evaluated on 121 real telephony recordings. On a curated 56-sample reference set, the best configuration achieves 96% AMI, 95% ARI, and 100% homogeneity, suggesting speaker identity is a viable underutilised signal for fraud detection workflows.

Enterprise Deployment Patterns DG^VoiC

3arXiv · cs.CL·5d ago·source ↗

First Turkish phone scam detection dataset evaluated across seven LLMs in multi-modal settings

Researchers introduce the first public multi-modal dataset of 100 aligned audio-transcript pairs of Turkish scam and benign phone calls, evaluating seven LLMs (Gemini 2.5 Flash/Flash-Lite/Pro, GPT-4o, Qwen Max/Plus/Turbo) under three input conditions. Transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. The work addresses a gap in low-resource language safety research and highlights the need for linguistically inclusive fraud detection systems.

AI Safety Research Multimodal Progress Google GPT-4o Gemini-2.5-Flash-Lite +3 more

5arXiv · cs.CL·26d ago·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

Evaluation and Benchmarking FastConformer-Large Efficient ASR Training with Conversations that Never Happened BEA-Dialogue

4arXiv · cs.CL·20d ago·source ↗

Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading

Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.

Multimodal Progress Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

5arXiv · cs.AI·19d ago·source ↗

Explainability pipeline reveals divergent cues used by deepfake speech detectors

Researchers propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence in deepfake speech detectors. Applied to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on the ASVspoof 5 benchmark, the method reveals that despite similar performance, each detector relies on fundamentally different cues: environmental noise, phoneme artifacts, and word boundaries respectively. Findings are validated via causal masking experiments that confirm performance degrades when primary cues are removed. The work advances interpretability of audio deepfake detection, relevant to AI safety and media authenticity.

Evaluation and Benchmarking AI Safety Research CA-MHFA Integrated Gradients SLS +4 more

4arXiv · cs.CL·13h ago·source ↗

Multi-stage explainability framework translates transformer speech models into clinical cognitive impairment narratives

A new arXiv preprint proposes a framework for making transformer-based speech cognitive impairment detection clinically interpretable by combining SHAP token attribution, linguistic feature analysis, and a four-stage LLM reasoning pipeline using LLaMA-3.1-70B-Instruct. The system is built on the SpeechCARE-Adaptive Gating Network multimodal model (F1=72.11% on NIA PREPARE) and maps outputs to four cognitive-linguistic dimensions. Physician evaluation on 70 samples showed strong alignment with clinical profiles and a System Usability Scale score of 82/100, suggesting practical clinical workflow integration potential.

Evaluation and Benchmarking AI Safety Research NIA PREPARE Llama 3.3 70B Instruct SpeechCARE-Adaptive Gating Network +3 more

3arXiv · cs.CL·4d ago·source ↗

End-to-end speech-to-speech conversational system for Algerian Dialect using modular NLP pipeline

Researchers present Dziri Voicebot, a modular speech-to-speech conversational system targeting Algerian Dialect, a low-resource language with codeswitching and orthographic challenges. The pipeline integrates Whisper-based ASR, transformer NLU, retrieval-augmented generation, and neural TTS, with dedicated datasets constructed for the telecom domain. The system reports low word error rate, high intent classification scores, and stable TTS quality, offering a reproducible baseline for low-resource dialectal conversational AI.

Multimodal Progress Bechiri and Lanasri [2026]Whisper Dziri Voicebot

6arXiv · cs.CL·14d ago·source ↗

BayLing-Duplex: Native full-duplex speech dialogue using a single autoregressive LLM

Researchers introduce BayLing-Duplex, a speech language model that achieves native full-duplex interaction — simultaneous listening and speaking — using a single autoregressive LLM with no auxiliary VAD or turn-taking module. Built by fine-tuning GLM-4-Voice on 400K samples plus a lightweight DPO stage, it reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, and improves speech-response quality substantially over Moshi. The approach adds only special tokens to the standard vocabulary, making it portable across LLM architectures without architectural changes.

Frontier Model Releases Multimodal Progress BayLing-Duplex InstructS2S-Eval Direct Preference Optimization (DPO)+3 more