An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures
[Thesis]
Salha Hassan Muhammed Qahl
Fokoue, Ernest
Rochester Institute of Technology
2014
104
Committee members: Chen, Linlin; Parody, Robert
Place of publication: United States, Ann Arbor; ISBN=978-1-321-40085-4
M.S.
Applied Statistics
Rochester Institute of Technology
2014
Is there any similarity between the contexts of the Holy Bible and the Holy Quran, and can this be proven mathematically? The purpose of this research is using the Bible and the Quran as our corpus, we explore the performance of various feature extraction and machine learning techniques. The unstructured nature of text data adds an extra layer of complexity in the feature extraction task, and the inherently sparse nature of the corresponding data matrices makes text mining a distinctly difficult task. Among other things, We assess the difference between domain-based syntactic feature extraction and domain-free feature extraction, and then use a variety of similarity measures like Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya, symmetries kullback-leibler, Jensen Shannon, probabilistic chi-square and clark. For a similarity to identify similarities and differences between sacred texts.
Mathematics; Statistics; Computer science
Pure sciences;Applied sciences;Data mining;Machine learning;Sacred texts;Similarity measures