THE AUTOMATIC ANALYSIS OF THE GEORGIAN SENTENCE

Oleg Kapanadze; Nunu Kapanadze

Authors

Oleg Kapanadze
Nunu Kapanadze

Abstract

Until recently, most basic research in Natural Language Technology (NLT) has been performed on “major” languages such as (predominantly) English but also German, Japanese, Chinese, French, and Spanish. At the same time, Low-Density Languages (LDL) compete to take advantage of modern digital technologies implemented in high-quality computing systems. As a result, the long-term viability of languages not specifically supported by NLT is at risk, which can lead to their digital extinction. This paper presents an undertaking for developing computational applications involving Georgian to fill a gap with technologically well-equipped languages and to lower the current scarcity of language resources for Georgian text processing. It is well known that Georgian is a language with rich inflectional morphology and with very few fixed structures on the sentence level. The languages of similar design are called Morphologically Rich and Less-Configurational (MR&LC). This paper concerns issues related to developing crucial NLT tools for the MR&LC Georgian language: We discuss the development of a Feature-Based Context-Free Grammar (FCFG) and a Featured Grammar parser for the Less-Resourced Georgian language. Generative lexicalised parsing models, which are the mainstay for probabilistic parsing, do not perform as well when applied to languages with free word order or rich morphology. Based on the syntactic valency property of the verb and language-specific features such as productive morphology, we designed a prototype FCFG parser for automatic syntactic chunking/shallow parsing of the Georgian clause, which we present here. As the initial step to the syntactic analysis, we reimplemented the rule-based Finite-State Morpholo gical Transducer for Georgian text morphological analysis, lemmatization and POS tagging. To build an interface between the TIGER XML scheme and an input format for conceived syntactic chunker, we had to disambiguate manually and reformate the output of the Georgian morphoparser. As a necessary step in the syntactic valency-driven Feature-Based Grammar parser implementation, we have studied the Georgian verb stock and clustered it according to syntactic valency features. Eight verb clusters with different valency distributions and syntactic frames are identified to date. For each cluster, we developed and started training a prototype Feature-Based Grammar version for Georgian. As a syntactic parsing testbed, we have utilized a broadly recognized open-source NLTK library developed using the Python programming language. In the meantime, we are developing a converter module capable of porting automatically the output of the morphoparser at hand into the acceptable format for the NLTK input engine. This would provide an option for piping the morphological transducer with the Feature-Based syntactic parser for linking them in an unsupervised shallow syntactic chunker/parser of the Georgian language text.

THE AUTOMATIC ANALYSIS OF THE GEORGIAN SENTENCE

Authors

Abstract

Downloads

Published

Issue

Section

License