Knowledge Discovery through Linking Multiple Heterogeneous, Unstructured Data Streams:
[Thesis]
Alodadi, Mohammad Saad S.
A Case of Clinical Notes Mining
Janeja, Vandana P.
University of Maryland, Baltimore County
2020
184
Ph.D.
University of Maryland, Baltimore County
2020
The focus of this dissertation is to discover patterns of significant relationships across unstructured, heterogeneous data streams. One such example domain is in Electronic Health Records (EHR) for text in treatments, tests, and diagnoses, particularly in clinical notes. While applying large and automated analysis on the unstructured data can be much more complicated than on structured data, it can potentially provide better support and information for decision making. From the health care providers' perspective, the large number of medical tests of a patient has further increased the complexity in identifying and determining an accurate diagnosis for each patient. In addition, the EHR should be able to support Clinical Decision Support Systems (CDS) for providing health professionals with the most recent and related biomedical literature to improve the decision-making process concerning a given patient's record. However, the current EHR systems lack this ability. This creates a gap between the advances in the biomedical domain and day to day practices within EHR systems. In this dissertation, we propose several methods to overcome these challenges. For extracting knowledge from heterogeneous unstructured data, we propose a weighted association rules mining method to extract significant entities from the unstructured data and generate weighted association rules among them. We also expand our data by utilizing ontology-based expansion. Our discovered rules reveal non-trivial interdependencies which can help support practitioners' decision, such as during clinical interventions. Our frequency-based methods generate rules with higher interestingness and relatedness rate provided by a health provider. Furthermore, on a temporal use case in EHR, our method shows an increase in rules in later days of hospital admission, which imitate the secondary diagnoses phenomenon, which are conditions that coexist at the time of admission that develop subsequently to the principal diagnosis. Our preliminary results with proposed weighted transactional item representation show promising results in identifying strongly related entities, for example, in medical entities (diagnosis, test, treatments). To improve the literature search for professionals, we propose and evaluate multiple query expansion and re-ranking methods. The expansion query methods rely on vocabularies and different embedding models. The re-ranking method relies on dual embedding retrieval indexes that focus on latent features compared to standard explicit terms. We experiment with these methods on publicly available ad-hoc information retrieval tasks in biomedical literature. The results show that combining latent features and explicit features increases the precision of the results. We performed an extrinsic evaluation on multiple datasets (a biomedical literature database and health forum data). We found that our embedding model can be generalized to offer data representation of medical textual data (such as biomedical literature, clinical notes, or medical social media data).