GEM Logo GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

1Saw Swee Hock School of Public Health and Institute of Data Science, National University of Singapore
2Department of Cardiology, Peking University People's Hospital
3National Institute of Health Data Science, Peking University
4Institute for Artificial Intelligence, Peking University

Abstract

While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters (e.g., QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN +7.4%↑), explainability (+22.7%↑), and grounding (+24.8%↑), making it more suitable for real-world clinical applications.

Overview of GEM capabilities

Figure 1. GEM offers superior granularity in ECG interpretation compared to state-of-the-art models and human-written reports. GEM's core capabilities: 1. Feature-Grounded Analysis: findings are precisely linked to measurable ECG features (e.g., QRS/RR intervals). 2. Evidence-Driven Diagnosis: conclusions are supported by clear and logical reasoning directly linked to ECG findings. 3. Realistic Interpretation Process: mimics how a clinician analyzes ECGs and arrive at a diagnosis.

Model: GEM's Architecture

GEM Method Overview

Figure 2. GEM's Architecture. Multimodal Encoding: Separate encoders process ECG time series and images to generate modality-specific representations, enabling a holistic analysis of ECG data. Cross-modal Alignment Learning: Time series and image representations are first aligned and then mapped to a textual space using a shared projector, ensuring coherent understanding for the LLM. Knowledge Guided Instruction Generation: Physiological features extracted from all 12 leads are sequenced and structured using a diagnosis guider, which prompts GPT-4o with domain-specific instructions to generate high-granularity instructional data. GEM is the first multimodal framework to synergistically integrate raw ECG time series signals, 12-lead plots, and textual instructions, leveraging their complementary strengths to advance grounded ECG understanding.

Dataset: ECG-Grounding

ECG-Grounding Dataset Overview

Figure 3. Comparison of ECG-Instruct and our ECG-Grounding. ECG-Grounding provides more accurate, holistic, and evidence-driven interpretations with diagnoses grounded in measurable ECG features. Currently, it contains 30,000 instruction pairs annotated with heartbeat-level physiological features. This is the first high-granularity ECG grounding dataset, enabling evidence-based diagnosis and improving the trustworthiness of medical AI. We will continue to release more ECG-Grounding data and associated beat-level features progressively.

Benchmark: Grounded ECG Understanding

To comprehensively evaluate whether the model achieves clinically grounded ECG interpretation capabilities comparable to real cardiologists, we introduce the Grounded ECG Understanding benchmark. This is a clinically motivated benchmark designed to evaluates the MLLM’s ability to identify detailed clues in ECG analysis, requiring it to provide specific details and relevant domain knowledge to support its interpretation. The benchmark evaluates various aspects of ECG interpretation, including:

Criterion Description
DiagnosisAccuracy Evaluates whether the generated diagnosis is correct, specific, and supported by ECG findings. Results are expressed as a percentage, indicating the average accuracy across identified key diagnoses.
AnalysisCompleteness Checks if all key ECG components (e.g., rhythm, intervals, waveforms, lead-specific findings) are discussed. Results are provided in absolute terms, indicating the average number of correctly addressed key ECG features for each sample.
AnalysisRelevance Assesses whether each explanation directly supports the diagnosis, with results showing on average how many points support the diagnosis with clear ECG evidence for each sample.
LeadAssessmentCoverage Evaluates how many of the 12 ECG leads are analyzed. Results indicate the average number of leads analyzed per sample, providing insight into the comprehensiveness of the ECG review.
LeadAssessmentAccuracy Verifies the accuracy of described lead findings (e.g., QRS, ST, T waves, amplitude, intervals, ST segments) against the ground truth interpretation.
ECGFeatureGrounding Determines if the interpretation references actual ECG features (e.g., QRS amplitude, PR interval) instead of generic terms. Results are scaled from 0 to 100.
EvidenceBasedReasoning Evaluates whether the diagnosis follows logical, evidence-supported steps. Results range from 0 to 100.
ClinicalDiagnosticFidelity Assesses if the model mimics how a clinician interprets ECG data, considering all relevant factors. Results are scaled from 0 to 100.

Results Overview

Task1: Grounded ECG Understanding

Models Diagnosis
Accuracy
Analysis
Completeness
Analysis
Relevance
LeadAssessment
Coverage
LeadAssessment
Accuracy
PULSE 75.94 2.37 2.39 5.84 2.00
GEM
SFT LLaVA-7B
86.34 4.41 5.01 51.60 33.27
GEM
SFT PULSE-7B
86.21 4.43 4.91 51.60 33.07
Models ECG Feature
Grounding
Evidence-Based
Reasoning
Clinical Diagnostic
Fidelity
Average
PULSE 50.18 52.40 51.63 51.40
GEM
SFT LLaVA-7B
74.48 75.09 75.28 75.28
GEM
SFT PULSE-7B
74.95 74.70 74.87 74.84

Table 1: Grounded ECG Understanding results.

Task2: ECG-Bench (Abnormality Detection)

Models PTB-XL Super CODE-15% CPSC 2018 CSN G12EC
AUC F1 HL AUC F1 HL AUC F1 HL Accuracy Accuracy
Random 50.3 33.2 50.1 48.8 15.0 32.1 51.2 15.1 28.8 11.6 12.1
GPT-4o 55.6 28.3 26.2 59.9 24.9 15.7 50.9 10.6 18.2 57.5 49.2
PULSE 82.4 74.8 11.0 90.7 85.4 5.0 76.9 57.6 8.6 85.2 78.2
GEM
SFT LLaVA-7B
81.8 73.6 11.6 90.5 84.8 5.1 74.1 52.0 9.0 92.6 81.8
GEM
SFT PULSE-7B
83.4 75.8 11.0 91.5 86.4 4.7 79.1 61.1 8.1 86.2 80.5
Ablations
GEM
TS only
81.2 72.5 11.9 90.8 84.9 5.0 76.3 54.0 8.5 91.6 81.4
GEM
TS+IMG
82.7 74.8 11.1 91.3 86.3 4.6 74.4 51.5 8.8 90.1 81.1

Table 2: ECG-Bench abnormality detection results.

Task3: ECG-Bench (Report Generation & ECG-QA)

Models PTB-XL Report ECG-QA
Report Score Accuracy
Random 0 16.2
GPT-4o 50.2 35.2
PULSE 61.3 73.8
GEM
SFT LLaVA-7B
65.0 71.0
GEM
SFT PULSE-7B
67.1 73.6

Table 3: ECG-Bench report generation and QA results.

Demonstrations

Case 1: Bradycardia

Case Study 1

Case 2: Atrial Fibrillation

Case Study 2

Case 3: Tachycardia

Case Study 3

BibTeX

@misc{lan2025gemempoweringmllmgrounded,
      title={GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images}, 
      author={Xiang Lan and Feng Wu and Kai He and Qinghao Zhao and Shenda Hong and Mengling Feng},
      year={2025},
      eprint={2503.06073},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.06073}, 
}