The development of a framework for semantic similarity measures for the Arabic language
General Material Designation
[Thesis]
First Statement of Responsibility
Al-Marsoomi, Faaza Abduljabar
.PUBLICATION, DISTRIBUTION, ETC
Name of Publisher, Distributor, etc.
Manchester Metropolitan University
Date of Publication, Distribution, etc.
2015
DISSERTATION (THESIS) NOTE
Dissertation or thesis details and type of degree
Thesis (Ph.D.)
Text preceding or following the note
2015
SUMMARY OR ABSTRACT
Text of Note
This thesis presents a novel framework for developing an Arabic Short Text Semantic Similarity (STSS) measure, namely that of NasTa. STSS measures are developed for short texts of 10 -25 words long. The algorithm calculates the STSS based on Part of Speech (POS), Arabic Word Sense Disambiguation (WSD), semantic nets and corpus statistics. The proposed framework is founded on word similarity measures. Firstly, a novel Arabic noun similarity measure is created using information sources extracted from a lexical database known as Arabic WordNet. Secondly, a novel verb similarity algorithm is created based on the assumption that words sharing a common root usually have a related meaning which is a central characteristic of Arabic language. Two Arabic word benchmark datasets, noun and verb are created to evaluate them. These are the first of their kinds for Arabic. Their creation methodologies use the best available experimental techniques to create materials and collect human ratings from representative samples of the Arabic speaking population. Experimental evaluation indicates that the Arabic noun and the Arabic verb measures performed well and achieved good correlations comparison with the average human performance on the noun and verb benchmark datasets respectively. Specific features of the Arabic language are addressed. A new Arabic WSD algorithm is created to address the challenge of ambiguity caused by missing diacritics in the contemporary Arabic writing system. The algorithm disambiguates all words (nouns and verbs) in the Arabic short texts without requiring any manual training data. Moreover, a novel algorithm is presented to identify the similarity score between two words belonging to different POS, either a pair comprising a noun and verb or a verb and noun. This algorithm is developed to perform Arabic WSD based on the concept of noun semantic similarity. Important benchmark datasets for text similarity are presented: ASTSS-68 and ASTSS-21. Experimental results indicate that the performance of the Arabic STSS algorithm achieved a good correlation comparison with the average human performance on ASTSS-68 which was statistically significant.