POS Tagging: A Logistic Analysis

801 Words4 Pages

Tag sets or lexical tags have an essential role in POS tagging because they provide significant information about a word and its neighbors in a corpus. So, a standard set of tags is necessary for the task of POS tagging in any language. A POS tag set defines the list of morphosyntactic categories that are applicable at the word-level to a specific language and have one tag for each parts-of-speech. It is a set of coarse syntactic POS categories that exists in a similar form across languages. Therefore, the same tag set can be used for multiple languages because of its universal characteristics. Now, the tag sets for a language can be divided into two major categories, namely, coarse-grained tag set and fine-grained tag set. A coarse-grained …show more content…

The Penn Treebank used a tagset of 45 tags and 61 tags were used for C5 tagset. However, the CLAWS2 tagset brought a change in the structure of the tagsets from a flat structure with unitary tags and introduced a hierarchical structure for decomposing tags. According to Baskaran, Bali et al (2008), a POS tag set design should take into consideration all the possible morphosyntactic categories that can occur in a particular language or a group of languages. Research work in POS tag set design for European and East Asian languages started with the basic listing of important morphosyntactic features in one language which has evolved in later years towards hierarchical tag sets, decomposable tags, and common framework for multiple languages (EAGLES) etc. Now, tagset for English follow the Penn Treebank tagset, but for languages like Catalan, Spanish, Russian, Italian, EAGLES tagset is used. According to them, the publication of EAGLES guidelines for morphosyntactic annotation of corpora was an earliest attempt to develop a common tagset guideline for several European …show more content…

But, the research work in tagset design in Indian Languages (IL) presents a contradictory picture. There have been very less work done in designing tagsets for Indian languages. One of the main reasons of the lack of research lies in the fact that most of the tagsets for ILs are language specific and cannot be used for tagging data in other language. This inconsistency causes a hindrance to the interoperability and reusability of annotated corpora which further affects the NLP research in ILs, where already the non-availability of tagged data is a serious issue. So, Baskaran, Bali et al (2008) have attempted to design a common POS-tagset framework for ILs, by providing a detailed analysis of eight languages from two major families, Indo-Aryan and Dravidian. They have developed the framework that follows the hierarchical tagset layout similar to the EAGLES guidelines, but with significant changes fitting the ILs requirements. According to them, both the Indo-Aryan and Dravidian Languages share noteworthy similarities in morphology and syntax which makes it desirable to design a common tagset framework that can exploit the similar features to facilitate the mapping of different tagsets to each other. So, the hierarchy of their IL POSTS framework has been set in three levels. The first level is the Obligatory level which consists of the

More about POS Tagging: A Logistic Analysis

Open Document