EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

Ge, Xueren; Murtaza, Sahil; Cortez, Anthony; Alemzadeh, Homa

EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

Xueren Ge Sahil Murtaza Anthony Cortez Homa Alemzadeh

University of Virginia | ACL Findings 2026

Paper Code Slides

Dataset

Overview

**Figure 1.** Qwen3-series model performance on conversational diagnosis prediction for cardiac arrest scenario.

Related works comparison of EMSDialog with existing medical dialogue datasets — **Table 1.** Related works of existing medical dialogue datasets.

Task Conversational diagnosis prediction aims to infer a patient’s likely condition during clinical conversations and make early yet reliable diagnostic decisions for time-critical care. Unlike traditional EHR-based diagnosis prediction, this task requires reasoning over incomplete streaming conversational evidence, updating predictions turn by turn, and deciding when to commit versus defer.

Model Limitation As shown in Figure 1, off-the-shelf zero-shot LLMs, although a plausible choice as conversational diagnostic agents, often fail to produce reliable, stable predictions in dynamic, turn-by-turn conversational settings. In particular, they (i) make incorrect, early, highly confident guesses when evidence is still sparse, and (ii) exhibit high prediction volatility, frequently switching their outputs as new information arrives instead of converging to a consistent diagnosis. Data Limitation Existing medical dialogue resources remain limited for this setting. As shown in Table 1, many online doctor–patient dialogue datasets are asynchronous and dyadic, making them less suitable for modeling real-time operational medical workflows. EHR-grounded human role-play datasets improve case realism, but they still typically assume only two speakers and often do not provide task-aligned diagnosis annotations. Synthetic medical dialogue datasets generated by rules or large language models are scalable, but they often overlook realistic topic flow, multi-party coordination, and the rich structured annotations needed for downstream tasks such as conversational diagnosis prediction.

Contributions

We propose a scalable, EHR-grounded, multi-agent pipeline for synthetic multi-party dialogue generation, ensuring realism and factuality via independent rule-based concept and topic-flow checkers and an iterative critique-and-refine loop.
We introduce EMSDialog, an EMS-specific synthetic dataset of 4,414 realistic multi-party conversations, generated based on a real-world ePCR dataset and annotated with 43 diagnoses, turn-level speaker roles, and topics. Human expert and LLM-based evaluations show strong quality at both the utterance level (realism, safety, role accuracy, groundedness) and the conversation level (logical flow, factuality, diversity). Datasets and code will be publicly released upon publication.
We demonstrate the downstream utility of EMSDialog by training models of different sizes for conversational diagnosis prediction and evaluating them on real-world EMS conversations. Experiments show that EMSDialog-augmented training improves prediction accuracy, timeliness, and stability, and that combining synthetic with real data yields the strongest overall performance.

Approach

Synthetic EMS dialogue generation pipeline from ePCRs — **Figure 3.** **EMSDialog synthetic dialogue generation pipeline.** Starting from an input ePCR, our pipeline first extracts key medical concepts and structured clinical features. The core of the framework is a set of independent **Checkers** that explicitly verify *concept coverage*, *topic-flow validity*, and *dialogue style*. These checkers provide structured feedback to three LLM agents: a **Planner**, which proposes a topic-guided dialogue plan grounded in ePCR evidence; a **Generator**, which produces an initial multi-party dialogue draft; and a **Refiner**, which improves coherence, realism, and natural phrasing while preserving factual grounding. The Planner and Generator are repeatedly revised until all hard constraints from the concept and topic-flow checkers are satisfied, while the Refiner further improves stylistic quality based on checker feedback. In this way, the checkers serve as the main mechanism for enforcing realism, logical structure, and factual consistency throughout the pipeline.

Results

Intrinsic evaluation table comparing conversation-level and utterance-level performance of synthetic dialogue generation methods — **Table 2.** Intrinsic evaluation: conversation- and utterance-level performance of synthetic dialogue generation methods. H*: human evaluation on a 43-scenario subset.

Ablation study plots — **Figure 3.** Ablation results. (a-b) Downstream forecasting performance: last accuracy and edit overheads. (c-d) Conversation-level and utterance-level evaluation.

Factual error source decomposition table — **Table 4.** Factual error source decomposition on 43 manually annotated scenarios.

EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

Overview

Approach

Results

Poster

BibTeX