Can GPT-4V Diagnose? A Deep Dive into AI’s Medical Imaging Capabilities

cover
1 Apr 2025

Authors:

(1) Senthujan Senkaiahliyan M. Mgt, is with the Institute for Health Policy Management and Evaluation, Faculty of Public Health, University of Toronto and Peter Munk Cardiac Centre, University Health Network, Toronto ON, Canada;

(2) Augustin Toma MD, is with the Department of Medical Biophysics, Faculty of Medicine, University of Toronto, Toronto, ON, Canada;

(3) Jun Ma PhD, is with Peter Munk Cardiac Centre, University Health Network; Department of Laboratory Medicine and Pathobiology, University of Toronto; Vector Institute, Toronto, ON Canada;

(4) An-Wen Chan MD, is with the Institute for Health Policy Management and Evaluation, Faculty of Public Health and with the Division of Dermatology, Department of Medicine, University of Toronto, Toronto, ON, Canada;

(5) Andrew Ha MD, is with Peter Munk Cardiac Centre, University Health Network and the Division of Cardiology, Department of Medicine, University of Toronto, Toronto, ON, Canada;

(6) Kevin R. An MD, is with the Division of Cardiac Surgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;

(7) Hrishikesh Suresh MD, is with the Division of Neurosurgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;

(8) Barry Rubin MD, is with Peter Munk Cardiac Centre, University Health Network and the Division of Vascular Surgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;

(9) Bo Wang PhD (Corresponding Author) is with Peter Munk Cardiac Centre, University Health Network; Department of Laboratory Medicine and Pathobiology and Department of Computer Science, University of Toronto; Vector Institute, Toronto, Canada. E-mail: bowang@vectorinstitute.ai.

Abstract and 1. Introduction GPT-4V(ision)

2. Data Collection

3. Experimental Setup

4. Results and References

5. Discussion and Limitations, and References

Supplementary Notes

Abstract

OpenAI’s large multimodal model, GPT-4V(ision), was recently developed for general image interpretation. However, less is known about its capabilities with medical image interpretation and diagnosis. Board-certified physicians and senior residents assessed GPT-4V’s proficiency across a range of medical conditions using imaging modalities such as CT scans, MRIs, ECGs, and clinical photographs. Although GPT-4V is able to identify and explain medical images, its diagnostic accuracy and clinical decision-making abilities are poor, posing risks to patient safety. Despite the potential that large language models may have in enhancing medical education and delivery, the current limitations of GPT-4V in interpreting medical images reinforces the importance of appropriate caution when using it for clinical decision-making.

1. INTRODUCING GPT-4V(ISION)

This past year, large language models (LLMs) demonstrated impressive capabilities to perform numerous language-based tasks. They have shown capability in analyzing text, discerning patterns, and establishing connections between words [1]. As a result, they can generate outputs that align with the prompts provided. While LLMs have expressed strong performance in expert-level medical question answering, they are still unable to outperform their clinician counterparts especially in scenarios that require reasoning capabilities [2].

Generative Pre-Trained Transformer Vision (GPT-4V) is OpenAI’s first large multimodal model with the ability to accept image input alongside text. [3] Multimodal learning is the ability for machine learning models to be trained on and input multiple forms of input data. They have the potential to enhance the breadth and depth of tasks that LLMs can perform across various medical disciplines. [4]

To evaluate GPT-4V’s proficiency in analyzing medical images, we conducted an evaluation involving senior residents and board-certified physicians to assess its capability to accurately interpret various medical conditions and provide accurate and useful information regarding the diagnosis and management of these conditions. The study aimed to assess whether GPT-4V could not only interpret medical images but also provide valuable information for diagnosis, management, and education. Finally, we aimed to evaluate if the resulting outputs align with the safety standards for patient care.

2. DATA COLLECTION

2.1 General Conditions

In the data collection phase, a diverse set of multimodal medical images were gathered to assess the performance of GPT-4V across various medical scenarios and specialties. The breakdown of multimodal images is presented in Table 1, showcasing different modalities and their respective counts. These images were sourced from open-source libraries and repositories found on the internet.

TABLE 1Breakdown of Multimodal Images.

2.2 Cardiology

The dataset used was a set of ECG waveforms sourced from the ECG Wave-Maven: A Self-Assessment Program for Students and Clinicians[1]. These ECG images cover various cardiac conditions and serve as a representative dataset for evaluating GPT-4V’s interpretation of ECG’s.

2.3 Dermatology

In dermatology, clinical photos were collected from the Hellenic Dermatological Atlas[2], to curate a comprehensive set of dermatological conditions for assessing GPT-4V’s performance in interpretation.

3. EXPERIMENTAL SETUP

The methodology employed for this comprehensive evaluation followed a structured four-phase approach.

3.1 Dataset Curation

A diverse range of medical images and corresponding labels were selected from public datasets, encompassing various diagnostic modalities such as patient clinical photos, radiological images, ECG traces, EEG, fundoscopy, endoscopy, and colonoscopy. GPT-4V analyzed these images based on the prompts. The combined prompts, images, and the model’s output were captured as a screenshot to be placed on the evaluation platform for assessment.

3.2 Evaluation Criteria

A dual approach was adopted to assess the accuracy and reliability of GPT-4V’s interpretations. All images were evaluated by two senior surgical residents (K.R.A, H.S.) and a board-certified internal medicine physician (A.T.). ECGs and clinical photos of dermatologic conditions were additionally evaluated by a board-certified cardiac electrophysiologist (A.H.) and dermatologist (A.C.), respectively.

The following below are the questionnaires used for the evaluation

General Conditions (Diverse Modalities):

• 1) Rate the answer from 1-5.

• 2) Rate from 1-5 how comfortable you would be letting a medical student rely on this content to help learning.

• 3) Was the image interpreted correctly? (Yes/No)

• 4) Was the advice correct? (Yes/No)

• 5) Was the advice given dangerous? (Yes/No)

Cardiology (ECGs):

• 1) Rate the overall interpretation of the ECG (1-5).

• 2) Compared to a standard automated read of an ECG, would you consider this interpretation more competent? (Yes/No)

• 3) Rate from 1-5 how comfortable you would be letting a medical student rely on this content to help learning.

• 4) Would this interpretation be helpful in a medical student’s learning? (Yes/No)

• 5) General Comments:

Dermatology (Clinical Photos):

• 1) Rate the quality of the layman’s description of the rash (1-5)

• 2) Rate the quality of the medical description of the rash (1-5)

• 3) Rate the quality of the differential diagnosis (1-5)

• 4) General Comments

3.3 Evaluation Platform

Figure 1 provides a visual representation of the evaluation platform, designed as a web-based form.

• Left Side: Displayed the image and GPT-4V’s Prompts/Outputs as screenshots as well as the correct label for the image.

• Right Side: Comprised a section for evaluator feedback, including questions from the evaluation criteria.

Fig. 1. Evaluation platform to collect clinician feedback on GPT4V’s output.

3.4 Prompts

The prompts that were used by GPT-4V to generate outputs are listed below for each category.

General Conditions (Diverse Modalities):

•”You are an expert [insert specialty i.e. radiology, cardiology] tutor. Explain to a medical student what this patient photo indicates.”

•”If indicators like arrows, asterisks, or circles are present in the image, underline and expand on their significance.”

Cardiology (ECGs):

•”You are an expert cardiology tutor assisting a medical student. Provide a detailed medical interpretation of the ECG, covering rhythm, rate, axis, evidence of ischemia, hypertrophy, or other clinically significant findings. Finally, list a differential diagnosis based on the ECG findings.”

Dermatology (Clinical Photos):

• ”You are an expert dermatology tutor helping a medical student. Describe the rash seen in the photo in layman’s terms. Next, describe it using medical terminology. Finally, list a differential diagnosis for the given image.”

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

  1. https://ecg.bidmc.harvard.edu/maven/mavenmain.asp

  2. http://www.hellenicdermatlas.com/en/