- Study found GPT-4-generated messages to patients were acceptable without any additional physician editing 58% of the time and provided more detailed educational information than those written by physicians
- AI-generated messages had shortcomings, including 7% of responses being deemed unsafe if left unedited
- Generative AI may promote efficiency and patient education, but require a “doctor in the loop” and a cautious approach as hospitals integrate algorithms into electronic health records
A new study by investigators from Mass General Brigham demonstrates that large language models (LLMs), a type of generative AI, may help reduce physician workload and improve patient education when used to draft replies to patient messages. The study also found limitations to LLMs that may affect patient safety, suggesting that vigilant oversight of LLM-generated communications is essential for safe usage. Findings, published in Lancet Digital Health, emphasize the need for a measured approach to LLM implementation.
Rising administrative and documentation responsibilities have contributed to increases in physician burnout. To help streamline and automate physician workflows, electronic health record (EHR) vendors have adopted generative AI algorithms to aid clinicians in drafting messages to patients; however, the efficiency, safety and clinical impact of their use had been unknown.
“Generative AI has the potential to provide a ‘best of both worlds’ scenario of reducing burden on the clinician and better educating the patient in the process,” said corresponding author Danielle Bitterman, MD, a faculty member in the Artificial Intelligence in Medicine (AIM) Program at Mass General Brigham and a physician in the Department of Radiation Oncology at Brigham and Women’s Hospital. “However, based on our team’s experience working with LLMs, we have concerns about the potential risks associated with integrating LLMs into messaging systems. With LLM-integration into EHRs becoming increasingly common, our goal in this study was to identify relevant benefits and shortcomings.”
For the study, the researchers used OpenAI’s GPT-4, a foundational LLM, to generate 100 scenarios about patients with cancer and an accompanying patient question. No questions from actual patients were used for the study. Six radiation oncologists manually responded to the queries; then, GPT-4 generated responses to the questions. Finally, the same radiation oncologists were provided with the LLM-generated responses for review and editing. The radiation oncologists did not know whether GPT-4 or a human had written the responses, and in 31% of cases, believed that an LLM-generated response had been written by a human.
On average, physician-drafted responses were shorter than the LLM-generated responses. GPT-4 tended to include more educational background for patients but was less directive in its instructions. The physicians reported that LLM-assistance improved their perceived efficiency and deemed the LLM-generated responses to be safe in 82.1 percent of cases and acceptable to send to a patient without any further editing in 58.3 percent of cases. The researchers also identified some shortcomings: If left unedited, 7.1 percent of LLM-generated responses could pose a risk to the patient and 0.6 percent of responses could pose a risk of death, most often because GPT-4’s response failed to urgently instruct the patient to seek immediate medical care.
Notably, LLM-generated/physician-edited responses were more similar in length and content to LLM-generated responses versus the manual responses. In many cases, physicians retained LLM-generated educational content, suggesting that they perceived it to be valuable. While this may promote patient education, the researchers emphasize that overreliance on LLMs may also pose risks, given their demonstrated shortcomings.
The emergence of AI tools in health has the potential to positively reshape the continuum of care and it is imperative to balance their innovative potential with a commitment to safety and quality. Mass General Brigham is leading the way in responsible use of AI, conducting rigorous research on new and emerging technologies to inform the incorporation of AI into care delivery, workforce support and administrative processes. Mass General Brigham is currently leading a pilot integrating generative AI into the electronic health record to draft replies to patient portal messages, testing the technology in a set of ambulatory practices across the health system.
Going forward, the study’s authors are investigating how patients perceive LLM-based communications and how patients’ racial and demographic characteristics influence LLM-generated responses, based on known algorithmic biases in LLMs.
“Keeping a human in the loop is an essential safety step when it comes to using AI in medicine, but it isn’t a single solution,” Bitterman said. “As providers rely more on LLMs, we could miss errors that could lead to patient harm. This study demonstrates the need for systems to monitor the quality of LLMs, training for clinicians to appropriately supervise LLM output, more AI literacy for both patients and clinicians, and on a fundamental level, a better understanding of how to address the errors that LLMs make.”
Authorship: Mass General Brigham co-authors include first author Shan Chen, MS, and Marco Guevara, Frank Hoebers, Benjamin Kann, Hugo Aerts and Raymond Mak of the AIM Program at Mass General Brigham and the Department of Radiation Oncology at Brigham and Women’s Hospital/Dana-Farber Cancer Institute, and Shalini Moningi, Hesham Elhalawani, Fallon Chipidza, and Jonathan Leeman (Brigham and Women’s Hospital). Additional co-authors include Timothy Miller, Guergana Savova, Jack Gallifant, Leo Celi, Maryam Lustberg, and Majid Afshar.
Disclosures: Bitterman is an Associate Editor of Radiation Oncology, HemOnc.org and receives funding from the American Association for Cancer Research. A complete list of disclosures is included in the paper.
Funding: Bitterman received financial support for this work from the National Institutes of Health (U54CA274516-01A1). Bitterman also received financial support from the Woods Foundation. A complete list of funding sources is included in the paper.
Paper cited: Chen, S et al. “The impact of using a large language model to respond to patient messages” Lancet Digital Health DOI: 10.1016/S2589-7500(24)00060-8/
###
About Mass General Brigham
Mass General Brigham is an integrated academic health care system, uniting great minds to solve the hardest problems in medicine for our communities and the world. Mass General Brigham connects a full continuum of care across a system of academic medical centers, community and specialty hospitals, a health insurance plan, physician networks, community health centers, home care, and long-term care services. Mass General Brigham is a nonprofit organization committed to patient care, research, teaching, and service to the community. In addition, Mass General Brigham is one of the nation’s leading biomedical research organizations with several Harvard Medical School teaching hospitals. For more information, please visit massgeneralbrigham.org.
Journal
The Lancet Digital Health
Method of Research
Computational simulation/modeling
Subject of Research
People
Article Title
The effect of using a large language model to respond to patient messages
Article Publication Date
24-Apr-2024
COI Statement
Dr. Bitterman is an Associate Editor of Radiation Oncology, HemOnc.org and receives funding from the American Association for Cancer Research. A complete list of disclosures is included in the paper