GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

Abstract

While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction data generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters (e.g., QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN +7.4%↑), explainability (+22.7%↑), and grounding (+25.3%↑), making it a promising approach for real-world clinical applications.

Figure 1. GEM offers superior granularity in ECG interpretation compared to state-of-the-art models and human-written reports. GEM's core capabilities: 1. Feature-Grounded Analysis: findings are precisely linked to measurable ECG features (e.g., QRS/RR intervals). 2. Evidence-Driven Diagnosis: conclusions are supported by clear and logical reasoning directly linked to ECG findings. 3. Realistic Interpretation Process: mimics how a clinician analyzes ECGs and arrive at a diagnosis.

Model: GEM's Architecture

Figure 2. GEM's Architecture. Multimodal Encoding: Separate encoders process ECG time series and images to generate modality-specific representations, enabling a holistic analysis of ECG data. Cross-modal Alignment Learning: Time series and image representations are first aligned and then mapped to a textual space using a shared projector, ensuring coherent understanding for the LLM. Knowledge Guided Instruction Data Generation: Physiological features extracted from all 12 leads are sequenced and structured using a diagnosis guider, which prompts GPT-4o with domain-specific instructions to generate high-granularity instructional data. GEM is the first multimodal framework to synergistically integrate raw ECG time series signals, 12-lead plots, and textual instructions, leveraging their complementary strengths to advance grounded ECG understanding.

Dataset: ECG-Grounding

Figure 3. Comparison of ECG-Instruct and our ECG-Grounding. ECG-Grounding provides more accurate, holistic, and evidence-driven interpretations with diagnoses grounded in measurable ECG features. Currently, it contains 30,000 instruction pairs annotated with heartbeat-level physiological features. This is the first high-granularity ECG grounding dataset, enabling evidence-based diagnosis and improving the trustworthiness of medical AI. We will continue to release more ECG-Grounding data and associated beat-level features progressively.

Benchmark: Grounded ECG Understanding

To comprehensively evaluate whether the model achieves clinically grounded ECG interpretation capabilities comparable to real cardiologists, we introduce the Grounded ECG Understanding benchmark. This is a clinically motivated benchmark designed to evaluates the MLLM’s ability to identify detailed clues in ECG analysis, requiring it to provide specific details and relevant domain knowledge to support its interpretation. The benchmark evaluates various aspects of ECG interpretation, including:

Criterion	Description
DiagnosisAccuracy	Evaluates whether the generated diagnosis is correct, specific, and supported by ECG findings. Results are expressed as a percentage, indicating the average accuracy across identified key diagnoses.
AnalysisCompleteness	Checks if all key ECG components (e.g., rhythm, intervals, waveforms, lead-specific findings) are discussed. Results are provided in absolute terms, indicating the average number of correctly addressed key ECG features for each sample.
AnalysisRelevance	Assesses whether each explanation directly supports the diagnosis, with results showing on average how many points support the diagnosis with clear ECG evidence for each sample.
LeadAssessmentCoverage	Evaluates how many of the 12 ECG leads are analyzed. Results indicate the average number of leads analyzed per sample, providing insight into the comprehensiveness of the ECG review.
LeadAssessmentAccuracy	Verifies the accuracy of described lead findings (e.g., QRS, ST, T waves, amplitude, intervals, ST segments) against the ground truth interpretation.
ECGFeatureGrounding	Determines if the interpretation references actual ECG features (e.g., QRS amplitude, PR interval) instead of generic terms. Results are scaled from 0 to 100.
EvidenceBasedReasoning	Evaluates whether the diagnosis follows logical, evidence-supported steps. Results range from 0 to 100.
ClinicalDiagnosticFidelity	Assesses if the model mimics how a clinician interprets ECG data, considering all relevant factors. Results are scaled from 0 to 100.

Results Overview

Task1: Grounded ECG Understanding

Dataset / Models	Diagnosis Accuracy	Analysis Completeness	Analysis Relevance	Lead Assessment Coverage	Lead Assessment Accuracy	ECG Feature Grounding	Evidence-Based Reasoning	Clinical Diagnostic Fidelity
MIMIC-IV-ECG (in-domain)
PULSE	81.14	2.37	2.39	7.11	2.95	50.18	52.40	51.63
GEM SFT LLaVA	87.24	4.41	5.01	71.07	46.44	75.48	75.09	75.28
GEM SFT PULSE	86.49	4.43	4.91	69.80	45.33	74.95	74.70	74.87
PTB-XL (out-domain)
PULSE	59.24	2.20	2.06	11.20	6.27	52.52	55.48	53.85
GEM SFT LLaVA	73.53	4.19	2.96	79.54	49.01	74.48	74.61	73.84
GEM SFT PULSE	73.59	4.19	3.00	78.86	47.96	74.97	75.41	74.24

Table 1: Grounded ECG Understanding results on MIMIC-IV-ECG and PTB-XL.

Task2: ECG-Bench (Abnormality Detection)

Models	PTB-XL Super			CODE-15%			CPSC 2018			CSN	G12EC
Models	AUC	F1	HL	AUC	F1	HL	AUC	F1	HL	Accuracy	Accuracy
Random	50.3	33.2	50.1	48.8	15.0	32.1	51.2	15.1	28.8	11.6	12.1
GPT-4o	55.6	28.3	26.2	59.9	24.9	15.7	50.9	10.6	18.2	57.5	49.2
PULSE	82.4	74.8	11.0	90.7	85.4	5.0	76.9	57.6	8.6	85.2	78.2
GEM SFT LLaVA	81.8	73.6	11.6	90.5	84.8	5.1	74.1	52.0	9.0	92.6	81.8
GEM SFT PULSE	83.4	75.8	11.0	91.5	86.4	4.7	79.1	61.1	8.1	86.2	80.5
Ablations
GEM TS only	81.2	72.5	11.9	90.8	84.9	5.0	76.3	54.0	8.5	91.6	81.4
GEM TS+IMG	82.7	74.8	11.1	91.3	86.3	4.6	74.4	51.5	8.8	90.1	81.1

Table 2: ECG-Bench abnormality detection results.

Task3: ECG-Bench (Report Generation & ECG-QA)

Models	PTB-XL Report	ECG-QA
Models	Report Score	Accuracy
Random	0	16.2
GPT-4o	50.2	35.2
PULSE	61.3	73.8
GEM SFT LLaVA	65.0	71.0
GEM SFT PULSE	67.1	73.6

Table 3: ECG-Bench report generation and QA results.

Demonstrations

Case 1: Bradycardia

Case 2: Atrial Fibrillation

Case 3: Tachycardia

BibTeX

@misc{lan2025gemempoweringmllmgrounded,
      title={GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images}, 
      author={Xiang Lan and Feng Wu and Kai He and Qinghao Zhao and Shenda Hong and Mengling Feng},
      year={2025},
      eprint={2503.06073},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.06073}, 
}