Morphological Tagging and Disambiguation in Dialectal Arabic Using Deep Learning Architectures
General Material Designation
[Thesis]
First Statement of Responsibility
Zalmout, Nasser
Subsequent Statement of Responsibility
Habash, Nizar
.PUBLICATION, DISTRIBUTION, ETC
Name of Publisher, Distributor, etc.
New York University Tandon School of Engineering
Date of Publication, Distribution, etc.
2020
PHYSICAL DESCRIPTION
Specific Material Designation and Extent of Item
185
DISSERTATION (THESIS) NOTE
Dissertation or thesis details and type of degree
Ph.D.
Body granting the degree
New York University Tandon School of Engineering
Text preceding or following the note
2020
SUMMARY OR ABSTRACT
Text of Note
Morphology is the study of the internal word structure, and how it interacts with orthography, phonology, semantics and syntax. Morphological tagging and disambiguation are integral Natural Language Processing (NLP) tasks, and often used as enabling technologies for higher order NLP models. While many recent NLP systems model morphology implicitly as part of an end-to-end system, explicit morphological modeling is particularly important for low-resource languages and dialects, for which morphology is more difficult to learn implicitly. Different languages have different morphological systems, which hinders the direct application of existing models developed for certain languages to others. Some languages are characterized as morphologically-rich, having a high number of inflections for a given base word, like Arabic, compared to poorer languages with fewer inflections, like English. The Arabic language is also diglossic in nature: Modern Standard Arabic (MSA) is the high register of the language, commonly used in news reports and official correspondence, while Dialectal Arabic (DA) is the low register and often used in daily interactions. Despite being used for official matters, MSA is not the native language for any Arabic speakers. DA is mainly spoken and is rarely written, it therefore lacks standardized orthography guidelines. Social media platforms have dramatically increased the volume of written DA content. However, the lack of standardized orthography guidelines renders much of this content highly sparse and inconsistent. Annotated DA resources are also still very limited for most dialects, which further complicates NLP tasks in general. Morphological modeling for DA is particularly challenging, since all of the previous challenges are further aggravated with DA's morphological richness, reducing the accuracy of DA morphological models severely. Due to these different challenges, morphological modeling in DA is an active area of research, that has recently been receiving increasing attention. In this dissertation we handle these challenges with various approaches, making use of the current breakthroughs in deep learning models. We develop several morphological modeling architectures that are noise-robust, and capable of handling morphologically-rich languages. We also present an architecture that models the different morphological features jointly, which enhances the overall modeling capacity and produces more consistent morphological analyses. This reduces the reliance on expensive external resources like morphological analyzers, which previous morphological modeling systems for Arabic are heavily dependent upon. We also present several cross-dialectal and transfer-learning techniques to help low-resource dialects. We capitalize on the high-resource MSA to help lower-resource dialects. We therefore consider MSA as another DA dialect, that happens to be high-resource and with well-defined orthography standards. Our models provide state-of-the-art morphological modeling results for MSA and three other dialects: Egyptian, Emirati, and Levantine Arabic.