Almanac
dataset

DialogPII

datasetactiveprovisionaldialogpii-ac6df75a·1 events·first seen 14h ago

Aliases: DialogPII

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.CL·14h ago·source ↗

DialogPII: Multilingual synthetic dialog dataset for PII detection in conversational data

Researchers introduce DialogPII, a multilingual dataset of synthetic dialog transcripts designed to support development and evaluation of automatic de-identification systems. The dataset covers 8 interaction scenarios (including healthcare, emergency calls, and therapy sessions), 19 PII entity types, and 11 languages, with dialogs generated semi-automatically using LLMs, then manually curated and localized. Speech versions were produced via TTS, transcribed with Whisper, and annotated through automatic projection plus manual correction. Baseline multilingual NER models are released alongside the dataset.