2024 Scientific Sessions

OR01-1
Extracting Data from Catheterization Reports Using Generative AI

Presenter

Yuval Barak-Corren, Children's Hospital of Philadelphia, Wynnewood, PA
Yuval Barak-Corren1, Jessica Tang, M.D.2, Christopher L Smith, MD, PhD2, Ryan M. Callahan, M.D., FSCAI3, Yoav Dori, MD, PhD2, Matthew J. Gillespie, M.D., FSCAI2, Jonathan J. Rome, M.D.2 and Michael L O’Byrne, MD, MSCE2, (1)Children's Hospital of Philadelphia, Wynnewood, PA, (2)Children's Hospital of Philadelphia, Philadelphia, PA, (3)Children's Hospital of Philadelphia, Wayne, PA

Keywords: Cath Lab Administration, Congenital Heart Disease (CHD) and Quality

Background:
In interventional cardiology, critical information often lies in catheterization reports' unstructured text. This poses a challenge in chart review, especially in multi-site studies, where reporting styles and medical terminologies can vary. Generative AI tools like ChatGPT could potentially streamline this process, enhancing uniformity and efficiency. However, the reliability and accuracy of data extraction by such tools remain uncertain. This was the focus of our study.

Methods:
The Children’s Hospital of Philadelphia Cardiac Center Data Warehouse was queried for all cardiac catheterization text reports between 06/27/2016 and 12/31/2022 (6.5 years). Only the ‘Hemodynamics’ section was extracted and used for analysis. Two ChatGPT models, Plus (GPT 4.0, release 05/03/2023) and a fine-tuned ChatGPT 3.0, were used to extract pulmonary (Qp) and systemic flow (Qs) data. The Plus model received detailed instructions and 15 few-shot examples; the ChatGPT 3.0 was trained with 300 report-output pairs. Both models returned their output in a table format, which was used for further analysis. A random sample of 100 cases was used to manually evaluate models’ accuracy.

Results:
A total of 3,351 reports were identified, of which 2,699 (81%) mentioned either Qp or Qs in the text. Both ChatGPT models successfully processed all reports. In the 100-sampled cases, ChatGPT Plus had a 14% error rate, versus 11% for ChatGPT 3.0. Using ChatGPT 4.0 to validate and correct its own results did not lead to improved accuracy. Both models agreed on the Qp, Qs, and Qp:Qs in 84%, 92%, and 76% of the cases, respectively. In a random sample of 100 cases with equal extractions in both models, there was 97% accuracy. The 3 errors included a typo (in original report) and 2 missed values.

Conclusions:
ChatGPT offers an exciting opportunity for analyzing large amounts of clinical data and for chart review. Nevertheless, it is not free of errors, with an error rate of 11-14% in our study. The use of two different models, each utilizing a different approach, can reduce this error rate to only 2-3%. As these models improve, and as hospitals implement in-house versions that can support protected health information, this could become a powerful tool for future research studies.