3arXiv cs.CL (Computation and Language)·5d ago

First Turkish phone scam detection dataset evaluated across seven LLMs in multi-modal settings

Researchers introduce the first public multi-modal dataset of 100 aligned audio-transcript pairs of Turkish scam and benign phone calls, evaluating seven LLMs (Gemini 2.5 Flash/Flash-Lite/Pro, GPT-4o, Qwen Max/Plus/Turbo) under three input conditions. Transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. The work addresses a gap in low-resource language safety research and highlights the need for linguistically inclusive fraud detection systems.

AI Safety Research Multimodal Progress Google GPT-4o Gemini-2.5-Flash-Lite Qwen Gemini-2.5-Pro OpenAI

Related guides (5)

Qwen

Qwen: Alibaba's Open-Weight AI Model Family

Read asBeginner

AI Safety ResearchTopic guide

AI Safety Research: From Lab Evals to Geopolitical Flashpoint

Read asIn-depth

Google

Google: The AI Lab That Builds Everything from DNA Models to Your Phone's Assistant

Read asBeginner In-depth

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner

GPT-4o

GPT-4o: OpenAI's All-in-One Multimodal Model

Read asBeginner

Related events (8)

3arXiv · cs.CL·13h ago·source ↗

Multimodal NLP pipeline for insurance fraud detection at FNOL using synthetic dialogue and audio

A new arXiv preprint introduces a synthetic multimodal framework for insurance fraud detection at the First Notice of Loss (FNOL) stage, combining ASR, speaker diarisation, NER, regex extraction, LLM-RAG retrieval, and speaker embeddings into a rule-based risk scoring system. The framework generates synthetic agent-customer dialogue transcripts and two-speaker audio to address the scarcity of multimodal fraud datasets. Component-level evaluations show stability and transfer potential, offering a reproducible baseline for multimodal fraud detection research.

Multimodal Progress Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection

3arXiv · cs.CL·13h ago·source ↗

DG^VoiC: Speaker clustering framework for fraud detection in call-centre audio

Researchers present DG^VoiC, a voice clustering framework designed to identify repeated speakers across anonymised call-centre recordings for insurance fraud investigation. The system combines anonymisation-aligned preprocessing, sliding-window speaker embeddings, and cosine similarity clustering, evaluated on 121 real telephony recordings. On a curated 56-sample reference set, the best configuration achieves 96% AMI, 95% ARI, and 100% homogeneity, suggesting speaker identity is a viable underutilised signal for fraud detection workflows.

Enterprise Deployment Patterns DG^VoiC

5arXiv · cs.CL·26d ago·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

Evaluation and Benchmarking FastConformer-Large Efficient ASR Training with Conversations that Never Happened BEA-Dialogue

3arXiv · cs.CL·24d ago·source ↗

First Komi-Yazva–Russian parallel corpus and LLM translation evaluation protocol for endangered low-resource language

Researchers introduce the first Komi-Yazva–Russian parallel corpus of 457 aligned sentence pairs from 74 narrative texts, paired with a rigorous evaluation protocol for studying LLM translation under extreme data scarcity. The protocol includes story-level cross-validation, deterministic retrieval-based few-shot prompting, and both reference-based and judge-based metrics to ensure leakage-aware, reproducible evaluation. Results show LLMs produce non-trivial translations but performance varies strongly by model family; retrieval-based few-shot prompting consistently outperforms zero-shot, though gains plateau quickly. The work frames the corpus as both a dataset contribution and a reproducible testbed for endangered-language machine translation research.

Evaluation and Benchmarking A Komi-Yazva–Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation Komi-Yazva–Russian Parallel Corpus

3arXiv · cs.CL·21d ago·source ↗

Supervised vs. in-context learning for Turkish multiword expression classification

A new arXiv paper evaluates Turkish idiomatic light verb construction (LVC) detection as a binary classification task, comparing a supervised BERTurk baseline against three instruction-tuned LLMs under zero-shot, one-shot, and few-shot prompting. Results show LLMs have very low LVC recall in zero-shot but improve substantially with demonstrations, though one-shot prompting can introduce strong model-specific biases. The supervised baseline remains competitive, while carefully constructed few-shot prompts allow GPT-OSS-20B and Qwen 2.5-14B to match or exceed it. The study highlights significant prompt sensitivity in Turkish metalinguistic classification tasks.

Evaluation and Benchmarking Qwen2.5-7B BERTurk gpt-oss-20b

6arXiv · cs.CL·26d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

4arXiv · cs.CL·11d ago·source ↗

IndicContextEval: Benchmark for context utilisation in Audio LLMs across 8 Indic languages

Researchers introduce IndicContextEval, a 56-hour multilingual speech benchmark covering 555 speakers across 8 Indian languages and 23 professional domains, designed to test whether Audio LLMs genuinely use textual context (domain descriptions, entity lists) or rely on parametric knowledge. The benchmark employs a 7-level prompting framework that progressively introduces contextual signals including adversarial prompts with incorrect entities. Evaluation of five models reveals substantial variation in context utilisation behaviour, exposing a gap in existing ASR benchmarks that test only fixed prompting conditions.

Evaluation and Benchmarking Multimodal Progress IndicContextEval

4arXiv · cs.CL·5d ago·source ↗

ParaPairAudioBench: Pairwise benchmark reveals large gaps in LALM paralinguistic judgment

Researchers introduce ParaPairAudioBench, a pairwise audio benchmark of 5,175 audio pairs spanning five paralinguistic dimensions (Style, Rate, Emphasis, Age, Gender) designed to evaluate Large Audio-Language Models as judges. Experiments show current LALMs lag human judgment by 32 percentage points on average and exhibit severe calibration failures, especially in ambiguous 'Tie' cases. The benchmark includes same-transcript and cross-transcript conditions to disentangle lexical from acoustic reliance, enabling more rigorous assessment of LALM reliability for speech evaluation.

Evaluation and Benchmarking Multimodal Progress ParaPairAudioBench