Efficient and Interpretable Machine Learning Algorithms for Predictive Analyses in Metagenomic Data
نام عام مواد
[Thesis]
نام نخستين پديدآور
Rahman, Mohammad Arifur
نام ساير پديدآوران
Rangwala, Huzefa
وضعیت نشر و پخش و غیره
نام ناشر، پخش کننده و غيره
George Mason University
تاریخ نشرو بخش و غیره
2020
يادداشت کلی
متن يادداشت
164 p.
یادداشتهای مربوط به پایان نامه ها
جزئيات پايان نامه و نوع درجه آن
Ph.D.
کسي که مدرک را اعطا کرده
George Mason University
امتياز متن
2020
یادداشتهای مربوط به خلاصه یا چکیده
متن يادداشت
Advancements in DNA sequencing technologies have enabled the direct investigation of the microbiome. Microbiome refers to all the microorganisms i.e., bacteria and viruses, present as a community in a host. Researchers and clinicians have embarked on studying the role of these microorganisms concerning human health and diseases. Most existing approaches first identify the microbial abundance in a sample using the sequence databases of known microorganisms and then use the abundance values as features for predicting diseases i.e., Liver Cirrhosis, Type-2 diabetes and other diseases. The taxonomic profiling and abundance quantification is computationally expensive, creates a bias in subsequent predictions and ignores a large amount of data that comes from the Next Generation Sequencing (NGS) technologies. Moreover, most microbes have not been laboratory-cultured and thus remain unknown. Existing approaches do not account for novel and unknown microorganisms. The lack of efficient analytical methods that overcome these limitations impedes the identification of the presence and functions of the microbial organisms within different clinical and environmental samples. Hence, there is a need to develop scalable analytical algorithms for large-scale DNA sequence data i.e., metagenomic data to discover the microbiome, perform taxonomic profiling, quantify species abundance and predict diseases. In this thesis, I develop Multiple Instance Learning (MIL) based algorithms to predict the diseases from large-scale Metagenomic data. Multiple Instance Learning (MIL) is a supervised classification approach that considers a single sample as a group of relevant data instances rather than just one single instance. In addition to predicting diseases, our proposed approaches can identify the individual microbial DNA sequences that are indicative of the diseases. We hypothesize that an optimized solution to the MIL formulation of the problem will predict diseases more accurately than existing approaches by utilizing the available DNA sequence data and avoiding the inherent bias from the microbial profiling process. To ensure that the proposed algorithms can scale to the large volume of input sequences (obtained from a Metagenomic sample) we propose efficient canopy based clustering solutions that can be integrated within the prediction pipeline. We evaluate the proposed algorithms on several clinical benchmarks and show improved prediction performance in terms of identifying clinical phenotypes, reporting interpretable results for clinicians and ensuring scalable implementations.
اصطلاحهای موضوعی کنترل نشده
اصطلاح موضوعی
Artificial intelligence
اصطلاح موضوعی
Bioinformatics
اصطلاح موضوعی
Computer science
اصطلاح موضوعی
Epidemiology
اصطلاح موضوعی
Genetics
اصطلاح موضوعی
Microbiology
نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )