Efficient and Interpretable Machine Learning Algorithms for Predictive Analyses in Metagenomic Data
[Thesis]
Rahman, Mohammad Arifur
Rangwala, Huzefa
George Mason University
2020
164 p.
Ph.D.
George Mason University
2020
Advancements in DNA sequencing technologies have enabled the direct investigation of the microbiome. Microbiome refers to all the microorganisms i.e., bacteria and viruses, present as a community in a host. Researchers and clinicians have embarked on studying the role of these microorganisms concerning human health and diseases. Most existing approaches first identify the microbial abundance in a sample using the sequence databases of known microorganisms and then use the abundance values as features for predicting diseases i.e., Liver Cirrhosis, Type-2 diabetes and other diseases. The taxonomic profiling and abundance quantification is computationally expensive, creates a bias in subsequent predictions and ignores a large amount of data that comes from the Next Generation Sequencing (NGS) technologies. Moreover, most microbes have not been laboratory-cultured and thus remain unknown. Existing approaches do not account for novel and unknown microorganisms. The lack of efficient analytical methods that overcome these limitations impedes the identification of the presence and functions of the microbial organisms within different clinical and environmental samples. Hence, there is a need to develop scalable analytical algorithms for large-scale DNA sequence data i.e., metagenomic data to discover the microbiome, perform taxonomic profiling, quantify species abundance and predict diseases. In this thesis, I develop Multiple Instance Learning (MIL) based algorithms to predict the diseases from large-scale Metagenomic data. Multiple Instance Learning (MIL) is a supervised classification approach that considers a single sample as a group of relevant data instances rather than just one single instance. In addition to predicting diseases, our proposed approaches can identify the individual microbial DNA sequences that are indicative of the diseases. We hypothesize that an optimized solution to the MIL formulation of the problem will predict diseases more accurately than existing approaches by utilizing the available DNA sequence data and avoiding the inherent bias from the microbial profiling process. To ensure that the proposed algorithms can scale to the large volume of input sequences (obtained from a Metagenomic sample) we propose efficient canopy based clustering solutions that can be integrated within the prediction pipeline. We evaluate the proposed algorithms on several clinical benchmarks and show improved prediction performance in terms of identifying clinical phenotypes, reporting interpretable results for clinicians and ensuring scalable implementations.