عنوان

Enhanced root extraction and document classification algorithm for Arabic text

پدید آورنده

Alsaad, Amal

موضوع

Data mining ; Machine learning ; Information retrieval

رده

کتابخانه

Center and Library of Islamic Studies in European Languages

محل استقرار

استان: Qom ـ شهر: Qom

تماس با کتابخانه : 32910706-025

NATIONAL BIBLIOGRAPHY NUMBER

Number

TLets699281

TITLE AND STATEMENT OF RESPONSIBILITY

Title Proper

Enhanced root extraction and document classification algorithm for Arabic text

General Material Designation

[Thesis]

First Statement of Responsibility

Alsaad, Amal

Subsequent Statement of Responsibility

Abbod, M.

.PUBLICATION, DISTRIBUTION, ETC

Name of Publisher, Distributor, etc.

Brunel University London

Date of Publication, Distribution, etc.

2016

DISSERTATION (THESIS) NOTE

Dissertation or thesis details and type of degree

Thesis (Ph.D.)

Text preceding or following the note

2016

SUMMARY OR ABSTRACT

Text of Note

Many text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate.

TOPICAL NAME USED AS SUBJECT

Data mining ; Machine learning ; Information retrieval

PERSONAL NAME - PRIMARY RESPONSIBILITY

Alsaad, Amal

PERSONAL NAME - SECONDARY RESPONSIBILITY

Abbod, M.

CORPORATE BODY NAME - SECONDARY RESPONSIBILITY

Brunel University

ELECTRONIC LOCATION AND ACCESS

Electronic name