Compression-Based Parts-Of-Speech Tagger for the Arabic Language
نام عام مواد
[Thesis]
نام نخستين پديدآور
Alkhazi, Ibrahim Sulaiman B.
وضعیت نشر و پخش و غیره
نام ناشر، پخش کننده و غيره
Bangor University (United Kingdom)
تاریخ نشرو بخش و غیره
2019
يادداشت کلی
متن يادداشت
160 p.
یادداشتهای مربوط به پایان نامه ها
جزئيات پايان نامه و نوع درجه آن
Ph.D.
کسي که مدرک را اعطا کرده
Bangor University (United Kingdom)
امتياز متن
2019
یادداشتهای مربوط به خلاصه یا چکیده
متن يادداشت
The Arabic language is a morphologically complex language that causes various difficulties for various NLP systems, such as POS tagging. The motive of this research is to investigate the development and training of a compression-based Arabic POS tagger using the PPM algorithm. The adoption of the algorithm for Arabic POS tagging may increase the efficiency and reduce the Arabic language ambiguity problem. The best text compression algorithms can be applied to NLP tasks often with state-of-the-art results. This research examines the use of tag-based compression of larger Arabic resources to re-evaluate the performance of tag-based compression which may reveal POS linguistic aspects of the Arabic language. We also found that tag-based text compression for the Arabic text can be utilised as a means of evaluating the performance and quality of the Arabic POS taggers. The results of the experiments show that the tag-based compression of the text can effectively be used for assessing the performance of Arabic POS taggers when used to tag different types of the Arabic text, and also as a means of comparing the performance of two Arabic POS taggers on the same text. With the rapid growth of Arabic text on the Web, studies that address the problems of classification and segmentation of the Arabic language are limited compared to other languages, most of which implement word-based and feature extraction algorithms. This research adopts a PPM character-based compression scheme to classify and segment Classical Arabic (CA) and Modern Standard Arabic (MSA) texts. An initial experiment using the PPM classification method on samples of text resulted in an accuracy of 95.5%, an average precision of 0.958, an average recall of 0.955 and an average F-measure of 0.954, using the concept of minimum cross-entropy. Segmenting the CA and MSA text using the PPM compression algorithm obtained an accuracy of 86%, an average precision of 0.869, an average recall of 0.86 and an average F-measure of 0.859. This research describes the creation of the new Bangor Arabic Annotated Corpus (BAAC) which is a Modern Standard Arabic (MSA) corpus that comprises 50K words manually annotated by parts-of-speech. For evaluating the quality of the corpus, the Kappa coefficient and a direct percent agreement for each tag were calculated for the new corpus and a Kappa value of 0.956 was obtained, with an average observed agreement of 94.25%. The corpus was used to evaluate the widely used Madamira Arabic POS tagger and to further investigate compression models for text compressed using POS tags. Also, a new annotation tool was developed and employed for the annotation process of the BAAC.
اصطلاحهای موضوعی کنترل نشده
اصطلاح موضوعی
Computer science
اصطلاح موضوعی
Language
نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )