4arXiv cs.CL (Computation and Language)·14h ago

SIMAX framework generates annotated synthetic clinician-patient dialogues for AI communication coding evaluation

Researchers introduce SIMAX, a framework for generating controlled, annotated synthetic clinician-patient dialogues to support development and evaluation of AI-driven clinical communication coding systems. The framework produces dialogues with reference behavioral annotations using two codebooks (Global and WISER), generating 3,388 simulated dialogues across three medical specialties with varied personas and accent conditions. Evaluation shows reasonable speech naturalness and high transcription fidelity, with downstream testing revealing the framework can expose sensitivity gaps in communication coding systems. The work addresses a data scarcity bottleneck in deploying ambient AI scribes in clinical settings.

Evaluation and Benchmarking SIMAX UTMOS WISER Codebook WV-MOS

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·4d ago·source ↗

Conceptual framework for analyzing dialogue dynamics in human-AI and multi-agent collaborative problem-solving

A new arXiv preprint proposes a hierarchical two-layer coding scheme for analyzing dialogue in collaborative problem-solving, integrating cognitive and metacognitive dimensions. The framework is validated across nine datasets spanning multiple domains and is positioned to apply to both human-AI and multi-agent collaboration contexts. A key finding is that metacognitive regulation is a strong discriminator of deeper collaboration quality.

Evaluation and Benchmarking Agent and Tool Ecosystem Bridging Talk and Thought: Understanding Dialogue Dynamics Across Collaborative Problem-Solving Contexts

4arXiv · cs.CL·14h ago·source ↗

DialogPII: Multilingual synthetic dialog dataset for PII detection in conversational data

Researchers introduce DialogPII, a multilingual dataset of synthetic dialog transcripts designed to support development and evaluation of automatic de-identification systems. The dataset covers 8 interaction scenarios (including healthcare, emergency calls, and therapy sessions), 19 PII entity types, and 11 languages, with dialogs generated semi-automatically using LLMs, then manually curated and localized. Speech versions were produced via TTS, transcribed with Whisper, and annotated through automatic projection plus manual correction. Baseline multilingual NER models are released alongside the dataset.

Evaluation and Benchmarking AI Safety Research DialogPII Whisper

4arXiv · cs.CL·38h ago·source ↗

Multi-stage explainability framework translates transformer speech models into clinical cognitive impairment narratives

A new arXiv preprint proposes a framework for making transformer-based speech cognitive impairment detection clinically interpretable by combining SHAP token attribution, linguistic feature analysis, and a four-stage LLM reasoning pipeline using LLaMA-3.1-70B-Instruct. The system is built on the SpeechCARE-Adaptive Gating Network multimodal model (F1=72.11% on NIA PREPARE) and maps outputs to four cognitive-linguistic dimensions. Physician evaluation on 70 samples showed strong alignment with clinical profiles and a System Usability Scale score of 82/100, suggesting practical clinical workflow integration potential.

Evaluation and Benchmarking AI Safety Research NIA PREPARE Llama 3.3 70B Instruct SpeechCARE-Adaptive Gating Network +3 more

3arXiv · cs.CL·38h ago·source ↗

Multimodal NLP pipeline for insurance fraud detection at FNOL using synthetic dialogue and audio

A new arXiv preprint introduces a synthetic multimodal framework for insurance fraud detection at the First Notice of Loss (FNOL) stage, combining ASR, speaker diarisation, NER, regex extraction, LLM-RAG retrieval, and speaker embeddings into a rule-based risk scoring system. The framework generates synthetic agent-customer dialogue transcripts and two-speaker audio to address the scarcity of multimodal fraud datasets. Component-level evaluations show stability and transfer potential, offering a reproducible baseline for multimodal fraud detection research.

Multimodal Progress Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection

5arXiv · cs.CL·27d ago·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

Evaluation and Benchmarking FastConformer-Large Efficient ASR Training with Conversations that Never Happened BEA-Dialogue

6arXiv · cs.AI·20h ago·source ↗

EMPATH: Multilingual multi-turn safety benchmark for emotional-support chatbots reveals score inflation and run-to-run reliability failures

EMPATH is a new arXiv benchmark for evaluating the safety of emotional-support chatbots, using an auditor model to generate multi-turn crisis conversations and a calibrated judge model to score transcripts across 19 metrics in five dimensions. Built for Mexican Spanish and US English, the benchmark surfaces score inflation on 10 of 19 metrics under uncalibrated rubrics and finds that run-to-run reliability is a per-model safety property: one model swings 2–10 points on a crisis metric across identical reruns, and DeepSeek V4 Pro produces different conversations at temperature 0. Evaluation of three frontier models shows aggregate scores within 0.74 points but per-metric divergences up to six points, with rankings stable across a cross-family judge at 93% within ±1.

Evaluation and Benchmarking AI Safety Research EMPATH DeepSeek V4

5arXiv · cs.CL·13d ago·source ↗

Study identifies 'synthetic lived experience paradox' in peer-like AI caregiver support

Researchers examine how LLMs prompted to sound peer-like generate language implying lived experience they cannot authentically possess, studying this in the context of family caregivers of Alzheimer's/ADRD patients. Using caregiver support exchanges from online communities and responses from LLaMA, GPT-4o-mini, and MedGemma, the study finds a 'narrative authenticity gap': AI captures emotional work of peer support but can fabricate experiential grounding. Psycholinguistic analysis shows human peers use significantly more first-person and past-focused language than AI. The authors argue caregiver-support AI needs mechanisms to distinguish supportive framing from fabricated lived experience.

AI Safety Research Alignment and RLHF GPT-4o mini Google Llama +4 more

5arXiv · cs.CL·18d ago·source ↗

ArogyaSutra: Multi-agent framework for multimodal medical reasoning in Indic languages

Researchers introduce ArogyaSutra, an actor-critic-based multi-agent framework for multilingual multimodal medical reasoning targeting Indic languages, alongside ArogyaBodha, a large-scale dataset spanning 31 body systems, six imaging modalities, and 21 clinical domains across English and seven Indian languages. The framework integrates tool grounding with dual-memory mechanisms and uses actor-critic simulation trajectories for distillation. The work addresses a critical gap in AI healthcare access for low-resource, multilingual settings like rural India where English-centric MLLMs fall short.

Agent and Tool Ecosystem Multimodal Progress ArogyaSutra IIT Patna ArogyaBodha