• Home
  • Advanced Search
  • Directory of Libraries
  • About lib.ir
  • Contact Us
  • History

عنوان
Fast Edit Distance Calculation Methods for NGS Sequence Similarity

پدید آورنده
Islam, A. K. M. Tauhidul

موضوع
Computer science

رده

کتابخانه
Center and Library of Islamic Studies in European Languages

محل استقرار
استان: Qom ـ شهر: Qom

Center and Library of Islamic Studies in European Languages

تماس با کتابخانه : 32910706-025

NATIONAL BIBLIOGRAPHY NUMBER

Number
TL52548

LANGUAGE OF THE ITEM

.Language of Text, Soundtrack etc
انگلیسی

TITLE AND STATEMENT OF RESPONSIBILITY

Title Proper
Fast Edit Distance Calculation Methods for NGS Sequence Similarity
General Material Designation
[Thesis]
First Statement of Responsibility
Islam, A. K. M. Tauhidul
Subsequent Statement of Responsibility
Pramanik, Sakti

.PUBLICATION, DISTRIBUTION, ETC

Name of Publisher, Distributor, etc.
Michigan State University
Date of Publication, Distribution, etc.
2020

GENERAL NOTES

Text of Note
136 p.

DISSERTATION (THESIS) NOTE

Dissertation or thesis details and type of degree
Ph.D.
Body granting the degree
Michigan State University
Text preceding or following the note
2020

SUMMARY OR ABSTRACT

Text of Note
Sequence fragments generated from targeted regions of phylogenetic marker genes provide valuable insight in identifying and classifying organisms and inferring taxonomic hierarchies. In recent years, significant development in targeted gene fragment sequencing through Next Generation Sequencing (NGS) technologies has increased the necessity of efficient sequence similarity computation methods for very large numbers of pairs of NGS sequences. The edit distance has been widely used to determine the dissimilarity between pairs of strings. All the known methods for the edit distance calculation run in near quadratic time with respect to string lengths, and it may take days or weeks to compute distances between such large numbers of pairs of NGS sequences. To solve the performance bottleneck problem, faster edit distance approximation and bounded edit distance calculation methods have been proposed. Despite these efforts, the existing edit distance calculation methods are not fast enough when computing larger numbers of pairs of NGS sequences. In order to further reduce the computation time, many NGS sequence similarity methods have been proposed using matching k-mers. These methods extract all possible k-mers from NGS sequences and compare similarity between pairs of sequences based on the shared k-mers. However, these methods reduce the computation time at the cost accuracy. In this dissertation, our goal is to compute NGS sequence similarity using edit distance based methods while reducing the computation time. We propose a few edit distance prediction methods using dataset independent reference sequences that are distant from each other. These reference sequences convert sequences in datasets into feature vectors by computing edit distances between the sequence and each of the reference sequences. Given sequences A, B and a reference sequence r, the edit distance, ed(A.B) ≥ (ed(A, r) ~ ed(B, r)). Since each reference sequence is significantly different from each other, with sufficiently large number of reference sequences and high similarity threshold, the differences of edit distances of A and B with respect to the reference sequences are close to the ed(A,B). Using this property, we predict edit distances in the vector space based on the Euclidean distances and the Chebyshev distances. Further, we develop a small set of deterministically generated reference sequences with maximum distance between each of them to predict higher edit distances more efficiently. This method predicts edit distances between corresponding sub-sequences separately and then merges the partial distances to predict the edit distances between the entire sequences. The computation complexity of this method is linear with respect to sequence length. The proposed edit distance prediction methods are significantly fast while achieving very good accuracy for high similarity thresholds. We have also shown the effectiveness of these methods on agglomerative hierarchical clustering. We also propose an efficient bounded exact edit distance calculation method using the trace [1]. For a given edit distance threshold d, only letters up to d positions apart can be part of an edit operation. Hence, we generate pairs of sub-sequences up to length difference d so that no edit operation is spilled over to the adjacent pairs of sub-sequences. Then we compute the trace cost in such a way that the number of matching letters between the sub-sequences are maximized. This technique does not guarantee locally optimal edit distance, however, it guarantees globally optimal edit distance between the entire sequences for distance up to d. The bounded exact edit distance calculation method is an order of magnitude faster than that of the dynamic programming edit distance calculation method.

UNCONTROLLED SUBJECT TERMS

Subject Term
Computer science

PERSONAL NAME - PRIMARY RESPONSIBILITY

Islam, A. K. M. Tauhidul

PERSONAL NAME - SECONDARY RESPONSIBILITY

Pramanik, Sakti

CORPORATE BODY NAME - SECONDARY RESPONSIBILITY

Michigan State University

ELECTRONIC LOCATION AND ACCESS

Electronic name
 مطالعه متن کتاب 

p

[Thesis]
276903

a
Y

Proposal/Bug Report

Warning! Enter The Information Carefully
Send Cancel
This website is managed by Dar Al-Hadith Scientific-Cultural Institute and Computer Research Center of Islamic Sciences (also known as Noor)
Libraries are responsible for the validity of information, and the spiritual rights of information are reserved for them
Best Searcher - The 5th Digital Media Festival