More physicians are turning to AI-generated clinical notes to assist with documentation. AI scribes can save providers significant time and improve practice efficiency. But are clinical notes produced by large language models like ChatGPT-4 accurate enough to be trusted?
A 2024 study published in the Journal of Medical Internet Research sought to answer this question. Researchers fed ChatGPT-4 transcripts from 14 simulated patient-provider encounters and prompted the model to generate a SOAP note. Researchers then evaluated the outputs using the Physician Documentation Quality Instrument (PDQI) scoring system.
This article reviews this recent research and discusses its implications for healthcare providers considering AI scribes to assist with clinical documentation.
Evaluating AI-Generated Medical Notes for Accuracy
The PDQI assesses the quality of clinical documentation using nine specific criteria: up-to-date, accurate, thorough, useful, organized, comprehensible, succinct, synthesized, and consistent. A maximum score of 45 points indicates a note that is “extremely” accurate.
In the study, researchers found that while ChatGPT-4 consistently produced notes in the requested SOAP format, note accuracy varied widely. Some notes received “good” or “excellent” scores, but the mean accuracy was “bad” (scoring just 30 out of 45 points).
Researchers categorized ChatGPT-4’s errors into three types:
- Errors of omission
- Incorrect facts
- Additions (sometimes called “hallucinations”)
The most common error was omission, meaning the AI model left out critical details from the patient encounter.
The study also found an inverse correlation between note accuracy and the transcript length. This means that as the complexity and volume of information in the patient encounter increased, the accuracy of the AI-generated notes decreased. This finding suggests that ChatGPT-4 can handle simple medical cases reasonably well but may struggle with more complex scenarios.
Finally, the study commented on the variance in the quality of notes. When researchers presented the same transcript and prompted ChatGPT-4 to generate a second and third SOAP note, some versions were much more accurate than others.
Should Doctors Use AI-Generated Clinical Notes?
The study’s results offer two takeaways for doctors considering AI tools for clinical documentation.
1. Use AI software designed for medicine.
It’s important that the study cited above used the standard ChatGPT-4 model, whereas most doctors using AI software choose a version designed for medicine.
Leading ambient clinical intelligence tools such as Conveyor AI and Augmedix use similar generative AI technology as ChatGPT but outperform generic models in a clinical setting. These tailored tools are fine-tuned on vast datasets of medical language and patient encounters, making them more reliable for clinical documentation.
For instance, AI tools built for medicine can interpret the nuances of patient-provider conversations, extract relevant medical details, and generate highly structured outputs that fit standard documentation formats like SOAP notes. Developers also refine these models to handle the unique challenges of healthcare interactions, including medical jargon, diagnoses, and treatments.
AI medical scribes also adhere to HIPAA standards, ensuring patient data stays safe. HIPAA compliance is critical for any AI tool used in clinical settings.
These differences highlight why physicians should opt for AI software built specifically for healthcare rather than relying on general-purpose models like ChatGPT.
2. Always review AI-generated notes for accuracy.
Physicians should always review AI-generated notes for accuracy, even when using tools designed for medicine. The study cited above demonstrates that while AI scribes can automate much of the note-taking process, they often produce errors.
It’s helpful to think of AI-generated clinical notes as equivalent to medical dictation. Research from 2018 found that automated speech recognition software used by doctors produced an error rate of 7.4%. However, after review by a transcriptionist and physician, the error rate dropped to 0.3%. Keep in mind that speech-to-text technology had recently gone mainstream in 2018.
In the same way that we’ve become accustomed to reviewing dictated notes for errors, doctors should always review AI-generated clinical notes before saving them to the EMR. Specialized tools are much more accurate than ChatGPT, but physician review remains an essential final step.
What Research Shows about AI Clinical Documentation
AI tools offer a powerful solution for doctors who want to spend less time on the computer and more time treating patients. Leading AI medical scribes can complete 80% of your EMR notes simply by listening to a patient encounter. However, providers need to write the final 20% to ensure completeness.
Notes written by generative AI can eliminate hours of manual documentation time by allowing doctors to review and correct a drafted note rather than writing from scratch. However, the study summarized here emphasizes that physicians must remain vigilant about reviewing these notes for accuracy.
Additionally, the variance in quality and accuracy seen in large language models like ChatGPT-4 underscores the importance of using AI tools built for healthcare. Physicians who adopt HIPAA-compliant medical AI scribe software will benefit from more accurate, structured, and context-aware notes.
The possibilities for AI in clinical documentation are exciting. As technology improves, AI-generated notes will become increasingly accurate and efficient. Eventually, these tools may entirely replace the need for manual EMR documentation. For now, however, physicians must continue playing an active role by reviewing their notes every time.