عنوان

Categorisation of Arabic Twitter text

پدید آورنده

Altamimi, Mohammed Hamed R.

موضوع

Natural Language Processing ; Machine Learning ; Prediction by Partial Matching

رده

کتابخانه

مرکز و کتابخانه مطالعات اسلامی به زبان‌های اروپایی

محل استقرار

استان: قم ـ شهر: قم

تماس با کتابخانه : 32910706-025

شماره کتابشناسی ملی

شماره

TLets801336

عنوان و نام پديدآور

عنوان اصلي

Categorisation of Arabic Twitter text

نام عام مواد

[Thesis]

نام نخستين پديدآور

Altamimi, Mohammed Hamed R.

نام ساير پديدآوران

Teahan, William

وضعیت نشر و پخش و غیره

نام ناشر، پخش کننده و غيره

Bangor University

تاریخ نشرو بخش و غیره

2020

یادداشتهای مربوط به پایان نامه ها

جزئيات پايان نامه و نوع درجه آن

Thesis (Ph.D.)

امتياز متن

2020

یادداشتهای مربوط به خلاصه یا چکیده

متن يادداشت

The shortage of Arabic language resources in the field of corpus linguistics compared to other popular languages such as English, Chinese and Spanish inspired this work. The research in the field of dialectal Arabic is still limited due to the relative unavailability of resources and the time-consuming nature of the task needed to create and process these corpora. This thesis introduces the Bangor Twitter Arabic corpus (BTAC) that was created specifically using Arabic Twitter text. The corpus contains over 122K tweets. The tweets were annotated manually into five main dialects, Egyptian, Gulf, Iraqi, Maghrebi, and Levantine, in addition to Modern Standard Arabic and Classical Arabic. The resource has also identified written code-switching in single tweet which occurs between Modern Standard Arabic and Arabic dialects. This thesis evaluates various methods for categorisation of Arabic Twitter text. The categorisation is performed on three main categorisation tasks: authorship attribution; gender categorisation; and dialect identification. The experiments are performed using the Prediction by Partial Matching (PPM) character-based text compression approach. Furthermore, well known algorithms were selected to perform the comparison using character-based and feature-based approaches such as Multinomial Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and Support Vector Machine (SVM). The results show that PPM outperforms traditional feature-based classifiers in most cases in terms of accuracy, precision, recall and F-measure. The results reported for classifying author multiple tweets achieved an accuracy of 88% for gender categorisation, an accuracy of 96% for authorship attribution, and an accuracy of 87% for dialect identification. In terms of single-tweet text categorisation, the results achieved an accuracy of 76% for gender categorisation, an accuracy of 77% for authorship attribution, and an accuracy of 74% for dialect identification. Further optimization using concatenated author models as the secondary class type improved the classification accuracy for both the gender and dialect experiments, achieving an accuracy of 97% for gender categorisation and an accuracy of 98% for dialect identification. We also investigated code-switching that often occurs in text acquired from social media. In this study we investigated code-switching between two variant linguistic systems from one language (Modern Standard Arabic and Arabic dialects). The purpose of the experiment was to detect the switch at the character level. An accuracy of 81.2% for detecting code-switching was obtained using 5-fold cross-validation on the full BTAC dataset.

موضوع (اسم عام یاعبارت اسمی عام)

موضوع مستند نشده

Natural Language Processing ; Machine Learning ; Prediction by Partial Matching

نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )

مستند نام اشخاص تاييد نشده

Altamimi, Mohammed Hamed R.

نام شخص - ( مسئولیت معنوی درجه دوم )

مستند نام اشخاص تاييد نشده

Teahan, William

شناسه افزوده (تنالگان)

مستند نام تنالگان تاييد نشده

Bangor University

دسترسی و محل الکترونیکی

نام الکترونيکي

وضعیت انتشار

فرمت انتشار

اطلاعات رکورد کتابشناسی

نوع ماده

[Thesis]

کد کاربرگه

276903

اطلاعات دسترسی رکورد

سطح دسترسي

تكميل شده

عنوان Categorisation of Arabic Twitter text

پدید آورنده Altamimi, Mohammed Hamed R.

موضوع Natural Language Processing ; Machine Learning ; Prediction by Partial Matching

رده

کتابخانه مرکز و کتابخانه مطالعات اسلامی به زبان‌های اروپایی

محل استقرار استان: قم ـ شهر: قم