Task
Conversational diagnosis prediction aims to infer a patient’s likely condition during clinical conversations and make early yet reliable diagnostic decisions for time-critical care. Unlike traditional EHR-based diagnosis prediction, this task requires reasoning over incomplete streaming conversational evidence, updating predictions turn by turn, and deciding when to commit versus defer.
Model Limitation
As shown in Figure 1, off-the-shelf zero-shot LLMs, although a plausible choice as conversational diagnostic agents, often fail to produce reliable, stable predictions in dynamic, turn-by-turn conversational settings. In particular, they (i) make incorrect, early, highly confident guesses when evidence is still sparse, and (ii) exhibit high prediction volatility, frequently switching their outputs as new information arrives instead of converging to a consistent diagnosis.
Data Limitation
Existing medical dialogue resources remain limited for this setting. As shown in Table 1, many online doctor–patient dialogue datasets are asynchronous and dyadic, making them less suitable for modeling real-time operational medical workflows. EHR-grounded human role-play datasets improve case realism, but they still typically assume only two speakers and often do not provide task-aligned diagnosis annotations. Synthetic medical dialogue datasets generated by rules or large language models are scalable, but they often overlook realistic topic flow, multi-party coordination, and the rich structured annotations needed for downstream tasks such as conversational diagnosis prediction.
Contributions
-
We propose a scalable, EHR-grounded, multi-agent pipeline for synthetic multi-party dialogue generation, ensuring realism and factuality via independent rule-based concept and topic-flow checkers and an iterative critique-and-refine loop.
-
We introduce EMSDialog, an EMS-specific synthetic dataset of 4,414 realistic multi-party conversations, generated based on a real-world ePCR dataset and annotated with 43 diagnoses, turn-level speaker roles, and topics. Human expert and LLM-based evaluations show strong quality at both the utterance level (realism, safety, role accuracy, groundedness) and the conversation level (logical flow, factuality, diversity). Datasets and code will be publicly released upon publication.
-
We demonstrate the downstream utility of EMSDialog by training models of different sizes for conversational diagnosis prediction and evaluating them on real-world EMS conversations. Experiments show that EMSDialog-augmented training improves prediction accuracy, timeliness, and stability, and that combining synthetic with real data yields the strongest overall performance.