Almanac
← Events
4arXiv cs.CL (Computation and Language)·15h ago

DialogPII: Multilingual synthetic dialog dataset for PII detection in conversational data

Researchers introduce DialogPII, a multilingual dataset of synthetic dialog transcripts designed to support development and evaluation of automatic de-identification systems. The dataset covers 8 interaction scenarios (including healthcare, emergency calls, and therapy sessions), 19 PII entity types, and 11 languages, with dialogs generated semi-automatically using LLMs, then manually curated and localized. Speech versions were produced via TTS, transcribed with Whisper, and annotated through automatic projection plus manual correction. Baseline multilingual NER models are released alongside the dataset.

Related guides (2)

Related events (8)

4arXiv · cs.CL·15h ago·source ↗

SIMAX framework generates annotated synthetic clinician-patient dialogues for AI communication coding evaluation

Researchers introduce SIMAX, a framework for generating controlled, annotated synthetic clinician-patient dialogues to support development and evaluation of AI-driven clinical communication coding systems. The framework produces dialogues with reference behavioral annotations using two codebooks (Global and WISER), generating 3,388 simulated dialogues across three medical specialties with varied personas and accent conditions. Evaluation shows reasonable speech naturalness and high transcription fidelity, with downstream testing revealing the framework can expose sensitivity gaps in communication coding systems. The work addresses a data scarcity bottleneck in deploying ambient AI scribes in clinical settings.

3arXiv · cs.CL·39h ago·source ↗

Multimodal NLP pipeline for insurance fraud detection at FNOL using synthetic dialogue and audio

A new arXiv preprint introduces a synthetic multimodal framework for insurance fraud detection at the First Notice of Loss (FNOL) stage, combining ASR, speaker diarisation, NER, regex extraction, LLM-RAG retrieval, and speaker embeddings into a rule-based risk scoring system. The framework generates synthetic agent-customer dialogue transcripts and two-speaker audio to address the scarcity of multimodal fraud datasets. Component-level evaluations show stability and transfer potential, offering a reproducible baseline for multimodal fraud detection research.

5Hugging Face Blog·1mo ago·source ↗

Experimenting with Automatic PII Detection on the Hub using Presidio

Hugging Face describes an experiment integrating Microsoft's Presidio library for automatic personally identifiable information (PII) detection across datasets hosted on the Hub. The effort aims to flag or redact sensitive data before it can be used in model training pipelines. This represents a practical infrastructure-level approach to data governance and privacy compliance for open ML datasets.

5arXiv · cs.CL·27d ago·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

3arXiv · cs.CL·6d ago·source ↗

First Turkish phone scam detection dataset evaluated across seven LLMs in multi-modal settings

Researchers introduce the first public multi-modal dataset of 100 aligned audio-transcript pairs of Turkish scam and benign phone calls, evaluating seven LLMs (Gemini 2.5 Flash/Flash-Lite/Pro, GPT-4o, Qwen Max/Plus/Turbo) under three input conditions. Transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. The work addresses a gap in low-resource language safety research and highlights the need for linguistically inclusive fraud detection systems.

4Hugging Face Blog·1mo ago·source ↗

Nemotron-Personas-India: Synthesized Data for Sovereign AI

NVIDIA and Hugging Face have released Nemotron-Personas-India, a synthetic dataset designed to support sovereign AI development in India. The dataset consists of synthesized persona data intended to improve AI model performance for Indian languages, cultures, and contexts. This release reflects growing interest in localized, culturally-grounded training data as a foundation for regional AI sovereignty initiatives.

4Hugging Face Blog·1mo ago·source ↗

Nemotron-Personas-Japan: Synthetic Dataset for Sovereign AI

NVIDIA has released Nemotron-Personas-Japan, a synthetic dataset hosted on Hugging Face designed to support sovereign AI development in Japan. The dataset appears to consist of persona-based synthetic data in Japanese, likely intended for fine-tuning or alignment of Japanese-language models. This release is part of NVIDIA's broader Nemotron data and model family, extending it to non-English sovereign AI use cases.

3arXiv · cs.CL·5d ago·source ↗

End-to-end speech-to-speech conversational system for Algerian Dialect using modular NLP pipeline

Researchers present Dziri Voicebot, a modular speech-to-speech conversational system targeting Algerian Dialect, a low-resource language with codeswitching and orthographic challenges. The pipeline integrates Whisper-based ASR, transformer NLU, retrieval-augmented generation, and neural TTS, with dedicated datasets constructed for the telecom domain. The system reports low word error rate, high intent classification scores, and stable TTS quality, offering a reproducible baseline for low-resource dialectal conversational AI.