عنوان

Fast Edit Distance Calculation Methods for NGS Sequence Similarity

پدید آورنده

Islam, A. K. M. Tauhidul

موضوع

Computer science

رده

کتابخانه

مرکز و کتابخانه مطالعات اسلامی به زبان‌های اروپایی

محل استقرار

استان: قم ـ شهر: قم

تماس با کتابخانه : 32910706-025

شماره کتابشناسی ملی

شماره

TLpq2404057188

زبان اثر

زبان متن نوشتاري يا گفتاري و مانند آن

انگلیسی

عنوان و نام پديدآور

عنوان اصلي

Fast Edit Distance Calculation Methods for NGS Sequence Similarity

نام عام مواد

[Thesis]

نام نخستين پديدآور

Islam, A. K. M. Tauhidul

نام ساير پديدآوران

Pramanik, Sakti

وضعیت نشر و پخش و غیره

نام ناشر، پخش کننده و غيره

Michigan State University

تاریخ نشرو بخش و غیره

2020

مشخصات ظاهری

نام خاص و کميت اثر

136

یادداشتهای مربوط به پایان نامه ها

جزئيات پايان نامه و نوع درجه آن

Ph.D.

کسي که مدرک را اعطا کرده

Michigan State University

امتياز متن

2020

یادداشتهای مربوط به خلاصه یا چکیده

متن يادداشت

Sequence fragments generated from targeted regions of phylogenetic marker genes provide valuable insight in identifying and classifying organisms and inferring taxonomic hierarchies. In recent years, significant development in targeted gene fragment sequencing through Next Generation Sequencing (NGS) technologies has increased the necessity of efficient sequence similarity computation methods for very large numbers of pairs of NGS sequences. The edit distance has been widely used to determine the dissimilarity between pairs of strings. All the known methods for the edit distance calculation run in near quadratic time with respect to string lengths, and it may take days or weeks to compute distances between such large numbers of pairs of NGS sequences. To solve the performance bottleneck problem, faster edit distance approximation and bounded edit distance calculation methods have been proposed. Despite these efforts, the existing edit distance calculation methods are not fast enough when computing larger numbers of pairs of NGS sequences. In order to further reduce the computation time, many NGS sequence similarity methods have been proposed using matching k-mers. These methods extract all possible k-mers from NGS sequences and compare similarity between pairs of sequences based on the shared k-mers. However, these methods reduce the computation time at the cost accuracy. In this dissertation, our goal is to compute NGS sequence similarity using edit distance based methods while reducing the computation time. We propose a few edit distance prediction methods using dataset independent reference sequences that are distant from each other. These reference sequences convert sequences in datasets into feature vectors by computing edit distances between the sequence and each of the reference sequences. Given sequences A, B and a reference sequence r, the edit distance, ed(A.B) ≥ (ed(A, r) ~ ed(B, r)). Since each reference sequence is significantly different from each other, with sufficiently large number of reference sequences and high similarity threshold, the differences of edit distances of A and B with respect to the reference sequences are close to the ed(A,B). Using this property, we predict edit distances in the vector space based on the Euclidean distances and the Chebyshev distances. Further, we develop a small set of deterministically generated reference sequences with maximum distance between each of them to predict higher edit distances more efficiently. This method predicts edit distances between corresponding sub-sequences separately and then merges the partial distances to predict the edit distances between the entire sequences. The computation complexity of this method is linear with respect to sequence length. The proposed edit distance prediction methods are significantly fast while achieving very good accuracy for high similarity thresholds. We have also shown the effectiveness of these methods on agglomerative hierarchical clustering. We also propose an efficient bounded exact edit distance calculation method using the trace [1]. For a given edit distance threshold d, only letters up to d positions apart can be part of an edit operation. Hence, we generate pairs of sub-sequences up to length difference d so that no edit operation is spilled over to the adjacent pairs of sub-sequences. Then we compute the trace cost in such a way that the number of matching letters between the sub-sequences are maximized. This technique does not guarantee locally optimal edit distance, however, it guarantees globally optimal edit distance between the entire sequences for distance up to d. The bounded exact edit distance calculation method is an order of magnitude faster than that of the dynamic programming edit distance calculation method.

موضوع (اسم عام یاعبارت اسمی عام)

موضوع مستند نشده

Computer science

نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )

مستند نام اشخاص تاييد نشده

Islam, A. K. M. Tauhidul

مستند نام اشخاص تاييد نشده

Pramanik, Sakti

دسترسی و محل الکترونیکی

نام الکترونيکي

وضعیت انتشار

فرمت انتشار

اطلاعات رکورد کتابشناسی

نوع ماده

[Thesis]

کد کاربرگه

276903

اطلاعات دسترسی رکورد

سطح دسترسي

تكميل شده

عنوان Fast Edit Distance Calculation Methods for NGS Sequence Similarity

پدید آورنده Islam, A. K. M. Tauhidul

موضوع Computer science

رده

کتابخانه مرکز و کتابخانه مطالعات اسلامی به زبان‌های اروپایی

محل استقرار استان: قم ـ شهر: قم