How large language models are entering the clinic

Healthcare is using large language models (LLMs) to help physicians make decisions, streamline clinical documentation, diagnose medical images, and much more. As these artificial intelligence solutions gain traction, it’s important that providers understand LLMs and how to use them ethically.

We’ve written a lot about LLMs when discussing key applications of AI in healthcare. For example, we’ve covered how clinicians use AI to streamline clinical documentation and how AI is transforming healthcare behind the scenes.

This article explains how to avoid pitfalls when incorporating LLMs into clinical practice. Our recommendations are based on a 2024 review of LLMs in medicine conducted by researchers at Stanford University.

What Are Large Language Models?

Large language models (LLMs) are artificial intelligence models trained on vast text datasets to generate humanlike outputs. LLMs captured public attention in 2022 with the release of ChatGPT, an AI-powered chatbot that can answer questions and engage in human-like conversations. Since ChatGPT’s launch, many industries—including healthcare—have raced to adopt LLM solutions.

While large language models are a type of generative AI, not all generative AI systems are LLMs. LLMs focus on generating and understanding language, whereas generative AI is a broader category of technologies that can produce text, images, music, and even videos.

LLM Training 101

A basic understanding of LLMs can help clinicians think critically about how to use them at work. Model developers “train” LLMs using three general steps:

  • Pre-training: The first step is to feed the model vast amounts of text so it learns meaningful relationships between words while ignoring irrelevant information. Training text can be general open-source data (internet text, Wikipedia, social media posts, etc.) or specialized datasets like scientific articles. After this initial training, a model can generate language but lacks the capacity for nuanced tasks.
  • Fine-tuning: Next, a human trains the model on more task-specific datasets like medical transcripts for a healthcare application. This stage aims to fine-tune the base model by embedding predefined rules or principles so it can perform particular tasks with controlled outputs.
  • Prompting: After fine-tuning, most models are ready for general use. However, additional prompting by a human with specialized knowledge can further tailor a model to perform specialized tasks, such as those required in clinical applications.

This basic LLM development process has produced many models, including several pre-trained or fine-tuned using medical literature. Unsurprisingly, researchers have found that LLMs trained on data specific to medicine are better at healthcare-related tasks. However, developers have successfully adapted general models like GPT for medical applications.

How Are Large Language Models Used in Medicine?

Healthcare practitioners are adopting large language models for all sorts of tasks. A 2024 review by Stanford University researchers described four key categories of LLM use in medicine:

  • Administrative: Summarizing medical notes, writing prior authorization letters, and drafting patient communication
  • Augmenting knowledge: Answering diagnostic questions, providing medical management advice, and translating patient education material
  • Medical education: Writing recommendation letters, creating exam questions, and summarizing medical text for students
  • Medical research: Generating research ideas and writing grants and academic papers

Each of these applications has the potential to make medicine much more efficient. For example, providers are using LLM-powered AI medical scribes to automatically draft a structured clinical note by listening to the patient-provider conversation during a visit.

However, with great power comes great responsibility. Understanding the potential pitfalls of LLMs in medicine empowers clinicians to implement appropriate mitigation strategies and ensure they adopt this technology thoughtfully.

Potential Pitfalls of LLMs in Medicine

While researchers and practitioners agree that large language models are powerful medical tools, studies also show some limitations. A recent review of academic research identified three potential pitfalls that clinicians using this technology should be aware of.

1. Accuracy Issues and Bias

LLMs are only as reliable as the data they are trained on. Since these models pull from vast datasets—containing verified and unverified sources—they can generate medically inaccurate information. Sometimes, LLMs fabricate references or generate plausible but incorrect medical advice.

Bias is another major issue. If LLM training data reflects existing disparities in healthcare, it may reinforce those biases. For example, studies have shown that some LLMs promote race-based medicine by recommending different pain management treatments for Black and White patients—a practice that is not supported by modern medical research.

2. Inconsistent Inputs and Outputs

LLMs rely heavily on input quality, meaning how you phrase a prompt significantly influences the response. Small changes in wording can produce drastically different outputs.

Another challenge is model drift, where an LLM’s outputs evolve unpredictably over time. Since these models are periodically updated, responses may differ from one version to the next, making it difficult for clinicians to rely on consistent results.

3. Privacy and Ethical Concerns

As with any digital healthcare data, LLMs are subject to HIPAA when handling protected health information. However, they produce some unique risks that practitioners should be aware of. 

Because LLMs are trained on vast datasets, they can inadvertently memorize and reproduce sensitive patient details when generating text. Even when patient data is de-identified, AI models can sometimes cross-reference information to infer identities. To mitigate these risks, LLMs used in healthcare require strict access controls, data anonymization techniques, and policies that maintain patient privacy.

LLMs also raise ethical concerns around plagiarism and authorship. Generative AI is so good at mimicking existing text that distinguishing machine-generated text from original work is often challenging. To maintain integrity, clinicians should use LLMs as assistive tools, not replacements. Always follow your organization’s guidelines for responsible AI use.

Large Language Models: Looking Ahead

As LLMs become more advanced, their uses in medicine will continue to expand. Researchers are already developing multimodal AI models, which integrate text, images, and other data types to improve clinical decision-making. Future models will be able to analyze lab results, medical imaging, and genomic data alongside textual information.

To ensure that LLMs enhance healthcare safely, developers need to work with policymakers and clinicians to establish guidelines that protect patient privacy, reduce bias, and maintain accuracy. Physicians who understand and use AI tools have an essential role to play in developing regulations that harness the power of LLMs while minimizing risks.

Comments are closed.