dataset
DialogPII
datasetactiveprovisional
dialogpii-ac6df75a·1 events·first seen 14h agoAliases: DialogPII
Co-occurring entities
More like this (12)
Recent events (1)
DialogPII: Multilingual synthetic dialog dataset for PII detection in conversational data
Researchers introduce DialogPII, a multilingual dataset of synthetic dialog transcripts designed to support development and evaluation of automatic de-identification systems. The dataset covers 8 interaction scenarios (including healthcare, emergency calls, and therapy sessions), 19 PII entity types, and 11 languages, with dialogs generated semi-automatically using LLMs, then manually curated and localized. Speech versions were produced via TTS, transcribed with Whisper, and annotated through automatic projection plus manual correction. Baseline multilingual NER models are released alongside the dataset.