EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

University of Virginia | ACL Findings 2026

Overview

Qwen3-series model performance on conversational diagnosis prediction for cardiac arrest scenario
Figure 1. Qwen3-series model performance on conversational diagnosis prediction for cardiac arrest scenario.
Related works comparison of EMSDialog with existing medical dialogue datasets
Table 1. Related works of existing medical dialogue datasets.

Task Conversational diagnosis prediction aims to infer a patient’s likely condition during clinical conversations and make early yet reliable diagnostic decisions for time-critical care. Unlike traditional EHR-based diagnosis prediction, this task requires reasoning over incomplete streaming conversational evidence, updating predictions turn by turn, and deciding when to commit versus defer.

Model Limitation As shown in Figure 1, off-the-shelf zero-shot LLMs, although a plausible choice as conversational diagnostic agents, often fail to produce reliable, stable predictions in dynamic, turn-by-turn conversational settings. In particular, they (i) make incorrect, early, highly confident guesses when evidence is still sparse, and (ii) exhibit high prediction volatility, frequently switching their outputs as new information arrives instead of converging to a consistent diagnosis. Data Limitation Existing medical dialogue resources remain limited for this setting. As shown in Table 1, many online doctor–patient dialogue datasets are asynchronous and dyadic, making them less suitable for modeling real-time operational medical workflows. EHR-grounded human role-play datasets improve case realism, but they still typically assume only two speakers and often do not provide task-aligned diagnosis annotations. Synthetic medical dialogue datasets generated by rules or large language models are scalable, but they often overlook realistic topic flow, multi-party coordination, and the rich structured annotations needed for downstream tasks such as conversational diagnosis prediction.

Contributions

  1. We propose a scalable, EHR-grounded, multi-agent pipeline for synthetic multi-party dialogue generation, ensuring realism and factuality via independent rule-based concept and topic-flow checkers and an iterative critique-and-refine loop.
  2. We introduce EMSDialog, an EMS-specific synthetic dataset of 4,414 realistic multi-party conversations, generated based on a real-world ePCR dataset and annotated with 43 diagnoses, turn-level speaker roles, and topics. Human expert and LLM-based evaluations show strong quality at both the utterance level (realism, safety, role accuracy, groundedness) and the conversation level (logical flow, factuality, diversity). Datasets and code will be publicly released upon publication.
  3. We demonstrate the downstream utility of EMSDialog by training models of different sizes for conversational diagnosis prediction and evaluating them on real-world EMS conversations. Experiments show that EMSDialog-augmented training improves prediction accuracy, timeliness, and stability, and that combining synthetic with real data yields the strongest overall performance.

Approach

Synthetic EMS dialogue generation pipeline from ePCRs
Figure 3. EMSDialog synthetic dialogue generation pipeline. Starting from an input ePCR, our pipeline first extracts key medical concepts and structured clinical features. The core of the framework is a set of independent Checkers that explicitly verify concept coverage, topic-flow validity, and dialogue style. These checkers provide structured feedback to three LLM agents: a Planner, which proposes a topic-guided dialogue plan grounded in ePCR evidence; a Generator, which produces an initial multi-party dialogue draft; and a Refiner, which improves coherence, realism, and natural phrasing while preserving factual grounding. The Planner and Generator are repeatedly revised until all hard constraints from the concept and topic-flow checkers are satisfied, while the Refiner further improves stylistic quality based on checker feedback. In this way, the checkers serve as the main mechanism for enforcing realism, logical structure, and factual consistency throughout the pipeline.

Results

Intrinsic evaluation table comparing conversation-level and utterance-level performance of synthetic dialogue generation methods
Table 2. Intrinsic evaluation: conversation- and utterance-level performance of synthetic dialogue generation methods. H*: human evaluation on a 43-scenario subset.
Conversational diagnosis prediction results of Qwen3-4B and Qwen3-32B models under no-train, static-train, and dynamic-train settings
Table 3. Conversational diagnosis prediction results of Qwen3-4B and Qwen3-32B models.
Ablation study plots
Figure 3. Ablation results. (a-b) Downstream forecasting performance: last accuracy and edit overheads. (c-d) Conversation-level and utterance-level evaluation.
Factual error source decomposition table
Table 4. Factual error source decomposition on 43 manually annotated scenarios.

Poster

BibTeX