Extracting Social Determinants of Health From Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods.
| Authors | |
| Keywords | |
| Abstract | BACKGROUND: Social determinants of health (SDoH) are critical drivers of health outcomes but are often underdocumented in structured electronic health record (EHR) data. Instead, SDoH are more commonly recorded in unstructured clinical notes, and unlocking this information could have far-reaching implications for advancing population health research and informing clinical decision-making.OBJECTIVE: This study develops and systematically evaluates cost-efficient methods for extracting SDoH information from unstructured clinical text using rule-based natural language processing (NLP) and large language model (LLM)-based approaches.METHODS: We constructed a gold-standard annotated corpus comprising clinical text segments from 171 patients in the Mass General Brigham Research Patient Data Registry, covering 7 SDoH domain categories and 23 subcategories. A rule-based system (RBS) was developed and evaluated alongside 7 OpenAI GPT models (GPT-4o, 4.1, 4.1-mini, o4-mini, GPT-5, GPT-5-mini, and o3) under zero-shot and few-shot settings using multiple prompting strategies. We additionally implemented late-fusion ensemble approaches that combined outputs from rule- and LLM-based methods. Performance was assessed using precision, recall, and F-score, alongside qualitative error analysis.RESULTS: The RBS achieved high precision for SDoH domain categories (0.96) but substantially lower recall (0.68). GPT-based models consistently outperformed the RBS in overall recall and F-scores. The best domain-level performance was observed for GPT-5 and GPT-5-mini in few-shot settings (F-score=0.89), while o4-mini achieved the highest subcategory-level performance (F-score=0.88). A late-fusion ensemble integrating RBS and GPT outputs further improved domain-level performance (F-score=0.92), with balanced precision (0.93) and recall (0.90), but did not improve subcategory-level performance.CONCLUSIONS: Recent GPT models with advanced reasoning capabilities, including the newly released mini models (eg, o4-mini and GPT-5-mini), demonstrated strong performance for SDoH extraction without task-specific fine-tuning and consistently outperformed the rule-based NLP system. Integrating rule- and LLM-based methods via late fusion further enhanced domain-level extraction performance. Our results demonstrate a cost-efficient framework for the accurate identification of SDoH from clinical text, facilitating downstream population health research and clinical informatics applications. |
| Year of Publication | 2026
|
| Journal | JMIR medical informatics
|
| Volume | 14
|
| Pages | e89534
|
| Date Published | 05/2026
|
| ISSN | 2291-9694
|
| DOI | 10.2196/89534
|
| PubMed ID | 42155986
|
| Links |