Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models.
| Authors | |
| Keywords | |
| Abstract | OBJECTIVE: Identifying medication discontinuations in electronic health records (EHRs) is vital for patient safety but is often hindered by information being buried in unstructured notes. This study aims to evaluate the capabilities of advanced open-sourced and proprietary large language models (LLMs) in extracting medications and classifying their medication status from EHR notes, focusing on their scalability for medication information extraction without human annotation.STUDY DESIGN AND SETTING: We collected three EHR datasets from diverse sources to build the evaluation benchmark: one publicly available dataset (Re-CASI), one we annotated based on public MIMIC notes (MIV-Med), and one internally annotated on clinical notes from Mass General Brigham (MGB-Med). We evaluated 12 advanced LLMs, including general-domain open-sourced models (e.g., Llama-3.1-70B-Instruct, Qwen2.5-72B-Instruct), medical-specific models (e.g., MeLLaMA-70B-chat), and a proprietary model (GPT-4o). We explored multiple LLM prompting strategies, including zero-shot, 5-shot, and Chain-of-Thought (CoT) approaches. Performance on medication extraction, medication status classification, and their joint task (extraction then classification) was systematically compared across all experiments.RESULTS: LLMs showed promising performance on medication extraction, while discontinuation classification and joint tasks were more challenging. GPT-4o consistently achieved the highest average F1 scores in all tasks under zero-shot setting - 94.0% for medication extraction, 78.1% for discontinuation classification, and 72.7% for the joint task. Open-sourced models followed closely, with Llama-3.1-70B-Instruct achieving the highest performance in medication status classification on the MIV-Med dataset (68.7%) and in the joint task on both the Re-CASI (76.2%) and MIV-Med (60.2%) datasets. Medical-specific LLMs demonstrated lower performance compared to advanced general-domain LLMs. Few-shot learning generally improved performance, while CoT reasoning showed inconsistent gains. Notably, open-sourced models occasionally surpassed GPT-4o performance, underscoring their potential in privacy-sensitive clinical research.CONCLUSION: LLMs demonstrate strong potential for medication extraction and discontinuation identification on EHR notes, with open-sourced models offering scalable alternatives to proprietary systems and few-shot learning further improving LLMs' capability.PLAIN LANGUAGE SUMMARY: Stopping a medicine can affect safety and treatment decisions, yet this detail is often buried in long electronic health record notes. We evaluated whether large language models, which read and summarize text, can automatically find medication names and decide whether each medicine is still being taken, has been stopped, or neither. We tested 12 models, including open-source options suitable for secure hospital use, on three collections of clinical notes and compared three simple instruction styles: giving no examples, showing a few examples, and asking for step-by-step reasoning. All models produced usable results. The strongest systems scored about 94 for finding medication names and about 78 for deciding continued or stopped status, on a standard 0 to 100 measure that balances completeness and correctness. Showing a few examples usually helped more than step-by-step prompts, and several open-source models performed close to a leading proprietary system. These tools could help hospitals and researchers monitor medications at scale to support drug-safety studies, adherence tracking, and clinical decision support, with local validation and safeguards before clinical use. |
| Year of Publication | 2025
|
| Journal | Journal of clinical epidemiology
|
| Pages | 112049
|
| Date Published | 11/2025
|
| ISSN | 1878-5921
|
| DOI | 10.1016/j.jclinepi.2025.112049
|
| PubMed ID | 41232578
|
| Links |