عنوان

Fast Edit Distance Calculation Methods for NGS Sequence Similarity

پدید آورنده

Islam, A. K. M. Tauhidul

موضوع

Computer science

رده

کتابخانه

Center and Library of Islamic Studies in European Languages

محل استقرار

استان: Qom ـ شهر: Qom

تماس با کتابخانه : 32910706-025

NATIONAL BIBLIOGRAPHY NUMBER

Number

TL52548

LANGUAGE OF THE ITEM

.Language of Text, Soundtrack etc

انگلیسی

TITLE AND STATEMENT OF RESPONSIBILITY

Title Proper

Fast Edit Distance Calculation Methods for NGS Sequence Similarity

General Material Designation

[Thesis]

First Statement of Responsibility

Islam, A. K. M. Tauhidul

Subsequent Statement of Responsibility

Pramanik, Sakti

.PUBLICATION, DISTRIBUTION, ETC

Name of Publisher, Distributor, etc.

Michigan State University

Date of Publication, Distribution, etc.

2020

GENERAL NOTES

Text of Note

136 p.

DISSERTATION (THESIS) NOTE

Dissertation or thesis details and type of degree

Ph.D.

Body granting the degree

Michigan State University

Text preceding or following the note

2020

SUMMARY OR ABSTRACT

Text of Note

Sequence fragments generated from targeted regions of phylogenetic marker genes provide valuable insight in identifying and classifying organisms and inferring taxonomic hierarchies. In recent years, significant development in targeted gene fragment sequencing through Next Generation Sequencing (NGS) technologies has increased the necessity of efficient sequence similarity computation methods for very large numbers of pairs of NGS sequences. The edit distance has been widely used to determine the dissimilarity between pairs of strings. All the known methods for the edit distance calculation run in near quadratic time with respect to string lengths, and it may take days or weeks to compute distances between such large numbers of pairs of NGS sequences. To solve the performance bottleneck problem, faster edit distance approximation and bounded edit distance calculation methods have been proposed. Despite these efforts, the existing edit distance calculation methods are not fast enough when computing larger numbers of pairs of NGS sequences. In order to further reduce the computation time, many NGS sequence similarity methods have been proposed using matching k-mers. These methods extract all possible k-mers from NGS sequences and compare similarity between pairs of sequences based on the shared k-mers. However, these methods reduce the computation time at the cost accuracy. In this dissertation, our goal is to compute NGS sequence similarity using edit distance based methods while reducing the computation time. We propose a few edit distance prediction methods using dataset independent reference sequences that are distant from each other. These reference sequences convert sequences in datasets into feature vectors by computing edit distances between the sequence and each of the reference sequences. Given sequences A, B and a reference sequence r, the edit distance, ed(A.B) ≥ (ed(A, r) ~ ed(B, r)). Since each reference sequence is significantly different from each other, with sufficiently large number of reference sequences and high similarity threshold, the differences of edit distances of A and B with respect to the reference sequences are close to the ed(A,B). Using this property, we predict edit distances in the vector space based on the Euclidean distances and the Chebyshev distances. Further, we develop a small set of deterministically generated reference sequences with maximum distance between each of them to predict higher edit distances more efficiently. This method predicts edit distances between corresponding sub-sequences separately and then merges the partial distances to predict the edit distances between the entire sequences. The computation complexity of this method is linear with respect to sequence length. The proposed edit distance prediction methods are significantly fast while achieving very good accuracy for high similarity thresholds. We have also shown the effectiveness of these methods on agglomerative hierarchical clustering. We also propose an efficient bounded exact edit distance calculation method using the trace [1]. For a given edit distance threshold d, only letters up to d positions apart can be part of an edit operation. Hence, we generate pairs of sub-sequences up to length difference d so that no edit operation is spilled over to the adjacent pairs of sub-sequences. Then we compute the trace cost in such a way that the number of matching letters between the sub-sequences are maximized. This technique does not guarantee locally optimal edit distance, however, it guarantees globally optimal edit distance between the entire sequences for distance up to d. The bounded exact edit distance calculation method is an order of magnitude faster than that of the dynamic programming edit distance calculation method.

UNCONTROLLED SUBJECT TERMS

Subject Term

Computer science

PERSONAL NAME - PRIMARY RESPONSIBILITY

Islam, A. K. M. Tauhidul

PERSONAL NAME - SECONDARY RESPONSIBILITY

Pramanik, Sakti

CORPORATE BODY NAME - SECONDARY RESPONSIBILITY

Michigan State University

ELECTRONIC LOCATION AND ACCESS

Electronic name

[Thesis]

276903

عنوان Fast Edit Distance Calculation Methods for NGS Sequence Similarity

پدید آورنده Islam, A. K. M. Tauhidul

موضوع Computer science

رده

کتابخانه Center and Library of Islamic Studies in European Languages

محل استقرار استان: Qom ـ شهر: Qom

NATIONAL BIBLIOGRAPHY NUMBER

LANGUAGE OF THE ITEM

TITLE AND STATEMENT OF RESPONSIBILITY

.PUBLICATION, DISTRIBUTION, ETC

GENERAL NOTES

DISSERTATION (THESIS) NOTE

SUMMARY OR ABSTRACT

UNCONTROLLED SUBJECT TERMS

PERSONAL NAME - PRIMARY RESPONSIBILITY

PERSONAL NAME - SECONDARY RESPONSIBILITY

CORPORATE BODY NAME - SECONDARY RESPONSIBILITY

ELECTRONIC LOCATION AND ACCESS

عنوان

Fast Edit Distance Calculation Methods for NGS Sequence Similarity

پدید آورنده

Islam, A. K. M. Tauhidul

موضوع

Computer science

کتابخانه

Center and Library of Islamic Studies in European Languages

محل استقرار

استان: Qom ـ شهر: Qom