TAL23-24: Inflectional Morphology

2.2.2 Inflectional Morphology

On the other hand, in inflectional morphology, the core meaning and POS of the word remain intact and the extensions are always predictable and limited to a set of possible features. Each feature has a finite set of associated values.

For example, the feature-value pairs number:plur and case:gen, indicate that that particular analysis of the word wakutubihi is plural in number and genitive in case, respectively.

Inflectional features are all obligatory and must have a specific (non-nil) value for every word. Some features have POS restrictions.

In Arabic, there are eight inflectional features. Aspect, mood, person and voice only apply to verbs, while case and state only apply to nouns/adjectives. Gender and number apply to both verbs and nouns/adjectives.

2.2.3 Cliticization Morphology

Cliticization is closely related to inflectional morphology. Similar to inflection, cliticization does not change the core meaning of the word. However, unlike inflectional features, which are all obligatory, clitics (i.e., clitic features) are all optional.

Moreover, while inflectional morphology is expressed using both templatic and concatenative morphology (i.e., using patterns, vocalisms and affixes), cliticization is only expressed using concatenative morphology (i.e., using affix-like clitics).

2.2.4 The Lexeme

The core meaning of a word in functional morphology is often referred to using a variety of terms, such as the lexeme, the lemma or the vocable. These terms are not equal.

A lexeme is a lexicographic abstraction: it is the set of all word forms that share a core meaning and differ only in inflection and cliticization.
1. For example, the lexeme 1 bayt1 ‘house’ includes bayt ‘house’, lilbayti ‘for the house’ and buyuwt ‘houses’ among others; while the lexeme 2 bayt2 ‘verse’ includes to bayt ‘verse’, lilbayti ‘for the verse’ and abyAt ‘verses’ among others.
2. Note that the singulars in the two lexemes are homonyms¹ but the plurals are not. This is called partial paradigm homonymy. Sometimes, two lexemes share the full inflectional paradigm but only differ in their meaning (full paradigm homonymy). For example, the lexemes for qAςida¯h1 ‘rule’ and qAςida¯h2 ‘base’. A lexeme can be referred to uniquely by supplementing the lemma with an index (as above), with additional forms that are necessary to distinguish the lexeme (such as the plural form) and/or with a gloss in another language.
By contrast, the lemma (also called the citation form) is a conventionalized choice of one of the word forms to stand for the set. For instance, the lemma of a verb is the third person masculine singular perfective form; while the lemma for a noun is the masculine singular form (or feminine singular if no masculine is possible). Lemmas typically are without any clitics and without any sense/meaning indices.
1. For the examples above, the lemmas are bayt and qAςida¯h, both of which collapse/ignore semantic differences and morphological differences. Lexemes are commonly represented using sense-indexed lemmas (as we saw above).
The term vocable is a purely morphological characterization of a set of word forms without semantic distinctions. Words with partial paradigm homonymy are represented with two vocables (e.g., bayt1 ‘house’ and bayt2 ‘verse’); however, words with full paradigm homonymy are represented with one vocable (e.g., qAςida¯h ‘rule/base’).

N.B.: The terms for root and stem are sometimes confused with lemma, lexeme and vocable.

3. Computational Morphology Tasks

3.1 Morphological analysis

Morphological analysis refers to the process by which a word (typically defined orthographically) has all of its possible morphological analyses determined. Each analysis also includes a single choice of core part-of-speech (such as noun or verb; the exact set is a matter of choice).

A morphological analysis can be either form-based, in which case we divide a word into all of its constituent morphemes, or functional, in which case we also interpret these morphemes.

For example, in broken (i.e., irregular) plurals, a form-based analysis may not identify the fact that the word is a plural since it lacks the usual plural morpheme while a functional analysis would.

3.2 Morphological generation

Morphological generation is essentially the reverse of morphological analysis. It is the process in which we map from an underlying representation of a word to a surface form (whether orthographic or phonological).

The big question for generation is what representation to map from. The shallower the representation, the easier the task. Some representation may be less constrained than others and as such lead to multiple valid realizations. Functional representations are often thought of as the prototypical starting point for generation.

3.3 Morphological disambiguation

Morphological disambiguation refers to the choosing of a morphological analysis in context.

This task for English is referred to as POS tagging since the standard POS tag set, though only comprising 46 tags, completely disambiguates English morphologically.

In Arabic, the corresponding tag set may comprise upwards of hundreds theoretically possible tags, so the task is much harder.

Reduced tag sets have been proposed for Arabic, in which certain morphological differences are conflated, making the morphological disambiguation task easier. The term POS tagging is usually used for Arabic with respect to some of the smaller tag sets.

3.4 Tokenization

Tokenization (also sometimes called segmentation) refers to the division of a word into clusters of consecutive morphemes, one of which typically corresponds to the word stem, usually including inflectional morphemes. Tokenization involves two kinds of decisions that define a tokenization scheme.

First, we need to choose which types of morphemes to segment. There is no single correct segmentation.

Second, we need to decide whether after separating some morphemes, we regularize the orthography of the resulting segments since the concatenation of morphemes can lead to spelling changes on their boundaries.

For example, the Ta-Marbuta (¯h) appears as a regular Ta (t) when followed by a pronominal enclitic; however, when we segment the enclitic, it may be desirable to return the Ta-Marbuta to its word-final form.

Usually, the term segmentation is only used when no orthography regularization takes place. Orthography regularization is desirable in NLP because it reduces data sparseness, as does tokenization itself.

3.5 Lemmatization

Lemmatization is the mapping of a word form to its corresponding lemma, the canonical representative of its lexeme. Lemmatization is a specific instantiation of the more general task of lexeme identification in which ambiguous lemmas are further resolved.

Lemmatization should not be confused with stemming, which maps the word into its stem. Another related task is root extraction, which focuses on identifying the root of the word.

3.6 Diacritization

Diacritization is the process of recovering missing diacritics (short vowels, nunation, the marker of the absence of a short vowel, and the gemination marker). Diacritization is closely related to morphological disambiguation and to lemmatization: for an undiacritized word form, different morphological feature values and different lemmas can both lead to different diacritizations.

--------------------------------------------------

Homonymy is the state of two words having identical form (same spelling and same pronunciation) but different meaning, e.g., bayt is both ‘house’ and ‘poetic verse’.

If these words have the same spelling but not same pronunciation, they are called homographs, e.g., the French word fils can be pronounced fiss 'son' or fil 'thread'.

If the two words have same pronunciation but different spelling, they are called homophones, e.g., French words verre ‘glass’ and vert 'green' are both pronounced same way.

A homonym must be both a homograph and a homophone.

Modifié le: lundi 22 avril 2024, 04:47