عنوان

The computational analysis of morphosyntactic categories in Urdu

پدید آورنده

Hardie, Andrew

موضوع

P Philology. Linguistics

رده

کتابخانه

Center and Library of Islamic Studies in European Languages

محل استقرار

استان: Qom ـ شهر: Qom

تماس با کتابخانه : 32910706-025

NATIONAL BIBLIOGRAPHY NUMBER

Number

TLets420555

TITLE AND STATEMENT OF RESPONSIBILITY

Title Proper

The computational analysis of morphosyntactic categories in Urdu

General Material Designation

[Thesis]

First Statement of Responsibility

Hardie, Andrew

.PUBLICATION, DISTRIBUTION, ETC

Name of Publisher, Distributor, etc.

Lancaster University

Date of Publication, Distribution, etc.

2004

DISSERTATION (THESIS) NOTE

Dissertation or thesis details and type of degree

Thesis (Ph.D.)

Text preceding or following the note

2004

SUMMARY OR ABSTRACT

Text of Note

Urdu is a language of the Indo-Aryan family, widely spoken in India and Pakistan, and an important minority language in Europe, North America, and elsewhere. This thesis describes the development of a computer-based system for part-of-speech tagging of Urdu texts, consisting of a tagset, a set of tagging guidelines for manual tagging or post-editing, and the tagger itself. The tagset is defined in accordance with a set of design principles, derived from a survey of good practice in the field of tagset design, including compliance with the EAGLES guidelines on morphosyntactic annotation. These are shown to be extensible to languages, such as Urdu, that are closely related to those languages for which the guidelines were originally devised. The description of Urdu grammar given by Schmidt (1999) is used as a model of the language for the purpose of tagset design. Manual tagging is undertaken using this tagset, by which process a set of tagging guidelines are created, and a set of manually tagged texts to serve as training data is obtained. A rule-based methodology is used here to perform tagging in Urdu. The justification for this choice is discussed. A suite of programs which function together within the Unitag architecture are described. This system (as well as a tokeniser) includes an analyser (Urdutag) based on lexical look-up and word-form analysis, and a disambiguator (Unirule) which removes contextually inappropriate tags using a set of 274 rules. While the system's final performance is not particularly impressive, this is largely due to a paucity of training data leading to a small lexicon, rather than any substantial flaw in the system.

TOPICAL NAME USED AS SUBJECT

P Philology. Linguistics

PERSONAL NAME - PRIMARY RESPONSIBILITY

Hardie, Andrew

CORPORATE BODY NAME - SECONDARY RESPONSIBILITY

Lancaster University

ELECTRONIC LOCATION AND ACCESS

Electronic name