NLP Methods for Extraction of Symptoms from Unstructured Data for Use in Prognostic COVID-19 Analytic Models | Journal of Artificial Intelligence Research

PDF OnlineAppendices

Published: Oct 14, 2021

DOI: https://doi.org/10.1613/jair.1.12631

Keywords:

Natural language processing, COVID-19, Information extraction, UMLS

Greg M. Silverman

Himanshu S. Sahoo

NLP/IE Program, Department of Electrical and Computer Engineering, University of Minnesota

Nicholas E. Ingraham

Division of Pulmonary, Allergy, Critical Care, and Sleep Medicine, University of Minnesota

Monica Lupei

Division of Critical Care, Department of Anesthesiology, University of Minnesota

Michael A. Puskarich

Department of Emergency Medicine, University of Minnesota

Michael Usher

Department of Medicine, University of Minnesota

James Dries

University of Minnesota

Raymond L. Finzel

NLP/IE Program, College of Pharmacy, University of Minnesota

Eric Murray

Information Technology, M Health Fairview

John Sartori

Department of Electrical and Computer Engineering, University of Minnesota

Gyorgy Simon

Institute for Health Informatics, University of Minnesota

Rui Zhang

Genevieve B. Melton

NLP/IE Program, Department of Surgery, and Institute for Health Informatics, University of Minnesota, Fairview Health Services, Information Technology

Christopher J. Tignanelli

NLP/IE Program, Department of Surgery, University of Minnesota

Serguei VS Pakhomov

NLP/IE Program, College of Pharmacy, University of Minnesota

Abstract

Statistical modeling of outcomes based on a patient's presenting symptoms (symptomatology) can help deliver high quality care and allocate essential resources, which is especially important during the COVID-19 pandemic. Patient symptoms are typically found in unstructured notes, and thus not readily available for clinical decision making. In an attempt to fill this gap, this study compared two methods for symptom extraction from Emergency Department (ED) admission notes. Both methods utilized a lexicon derived by expanding The Center for Disease Control and Prevention's (CDC) Symptoms of Coronavirus list. The first method utilized a word2vec model to expand the lexicon using a dictionary mapping to the Uni ed Medical Language System (UMLS). The second method utilized the expanded lexicon as a rule-based gazetteer and the UMLS. These methods were evaluated against a manually annotated reference (f1-score of 0.87 for UMLS-based ensemble; and 0.85 for rule-based gazetteer with UMLS). Through analyses of associations of extracted symptoms used as features against various outcomes, salient risks among the population of COVID-19 patients, including increased risk of in-hospital mortality (OR 1.85, p-value < 0.001), were identified for patients presenting with dyspnea. Disparities between English and non-English speaking patients were also identified, the most salient being a concerning finding of opposing risk signals between fatigue and in-hospital mortality (non-English: OR 1.95, p-value = 0.02; English: OR 0.63, p-value = 0.01). While use of symptomatology for modeling of outcomes is not unique, unlike previous studies this study showed that models built using symptoms with the outcome of in-hospital mortality were not significantly different from models using data collected during an in-patient encounter (AUC of 0.9 with 95% CI of [0.88, 0.91] using only vital signs; AUC of 0.87 with 95% CI of [0.85, 0.88] using only symptoms). These findings indicate that prognostic models based on symptomatology could aid in extending COVID-19 patient care through telemedicine, replacing the need for in-person options. The methods presented in this study have potential for use in development of symptomatology-based models for other diseases, including for the study of Post-Acute Sequelae of COVID-19 (PASC).

Issue

Vol. 72 (2021)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details