TAL23-24: Chapter 3: Morphology | USMT Cours en ligne

1. Introduction

Morphology is a branch of linguistics and NLP that involves the study of the grammatical structure of words and how words are formed and varied within the lexicon of any given language.

Morphology studies the relationship between morphemes, referring to the smallest meaningful (functional meaning, content meaning) unit in a word, and how these units can be arranged to create new words or new forms of the same word.

1.1 Morphological analysis

In natural language processing (NLP), morphological analysis refers to the process of analyzing the structure and formation of words, particularly how words are built from smaller units called morphemes. It involves breaking down words into these morphemes to understand their individual meanings and how they contribute to the overall meaning of the word.

This analysis is crucial for tasks such as stemming, lemmatization, and understanding word forms.

1.2 Word vs morpheme

So, what is a word? And what is a morpheme?

For example,

Is "I'm" in the sentence "I'm a computer scientist" a single word?

Is "فسيكتبونه" a single word?

If the latter is one word, then what is its part-of-speech? Means what is its type?

Is it a verb? Is it conjunction particle? Is it pronoun?

But just how do we define a "word"?

In text like this, we can easily spot "words" because they are separated from each other by spaces or by punctuation.
There are no easy answers to this question. The situation is more complicated, it depends also on the language typology.

1.3 Morphology in languages

Languages differ in how they do morphology and in how much morphology they have. Ther are:

Isolating (or analytic) languages like Chinese or English have very little inflectional morphology and are also not rich in derivation. Most words consist of a single morpheme.
Agglutinative languages like Turkish or Telugu have many affixes and can stack them one after another like beads on a string.
Fusional (or flexional) languages like Spanish or German pack many inflectional meanings into single affixes, so that they are morphologically rich without “stacking” prefixes or suffixes.
Templatic languages like Arabic or Amharic are a special kind of fusional languages that perform much of their morphological work by changes internal to the root.

1.4 Definitions

In English the word in defined as a sequence of morphemes.
For example, the word "unhappiness"

There are three morphemes, each carrying a certain amount of meaning.

un means "not",
while ness means "being in a state or condition".
Happy is a free morpheme because it can appear on its own (as a "word" in its own right). Bound morphemes have to be attached to a free morpheme, and so cannot be words in their own right. Thus, you can't have sentences in English such as "Jason feels very un ness today".

The morpheme, which is defined as the "minimal unit of meaning" can also be defined as "the minimal unit of grammatical analysis".

The figure bellow shows the morpheme types.

2. Two approches

Morphology is the study of internal word structure. We distinguish two types of approaches to morphology: form-based morphology and functional morphology. Form-based morphology is about the form of units making up a word, their interactions with each other and how they relate to the word’s overall form. By contrast, functional morphology is about the function of units inside a word and how they affect its overall behavior syntactically and semantically.

A chart of the various morphological terms discussed in this section is presented in figure bellow.

2.1 Form-based morphology

A central concept in form-based morphology is the morpheme, the smallest meaningful unit in a language.

A distinguishing feature of Semitic (such as Arabic) morphology is the presence of templatic morphemes in addition to concatenative morphemes. Concatenative morphemes participate in forming the word via a sequential concatenation process, whereas templatic morpheme are interleaved (interdigitated, merged).

2.1.1 Concatenative Morphology

In Arabic, there are three types of concatenative morphemes: stems, affixes and clitics. At the core of concatenative morphology is the stem, which is necessary for every word. Affixes attach to the stem.

There are three types of affixes:

prefixes attach before the stem, e.g., +ن n+ ‘first person plural of imperfective verbs’;
suffixes attach after the stem, e.g., ون+ +wn ‘nominative definite masculine sound plural’; and
circumfixes surround the stem, e.g., ت++ين t++yn ‘second person feminine singular of imperfective indicative verbs’. Circumfixes can be considered a coordinated prefix-suffix pair.

Modern Standard Arabic (MSA) has no pure prefixes that act with no coordination with a suffix.

Clitics attach to the stem after affixes. A clitic is a morpheme that has the syntactic characteristics of a word but shows evidence of being phonologically bound to another word. In this respect, a clitic is distinctly different from an affix, which is phonologically and syntactically part of the word.

Proclitics are clitics that precede the word (like a prefix), e.g., the conjunction +و w+ ‘and’ or the definite article +ال Al+ ‘the’.

Enclitics are clitics that follow the word (like a suffix), e.g., the object pronoun هم+ +hm ‘them’.

Multiple affixes and clitics can appear in a word. For example, the word وسيكتبونها wasayaktubuwnahA اas two proclitics, one circumfix and one enclitic.

The stem can be templatic or non-templatic. Templatic stems are stems that can be formed using templatic morphemes, whereas non-templatic word stems (NTWS) are not derivable from templatic morphemes. NTWSes tend to be foreign names and borrowed nominal terms (but never verbs), e.g., لندن.

NTWS can take nominal affixational and cliticization morphemes, e.g., واللندنيون ‘and the Londoners’.

2.1.2 Templatic Morphology

Templatic morphemes come in three types that are equally needed to create a word templatic stem: roots, patterns and vocalisms.

The root morpheme is a sequence of (mostly) three, (less so) four, or very rarely five consonants (termed radicals). The root signifies some abstract meaning shared by all its derivations. For example, the words katab ‘to write’, kAtib ‘writer’, maktuwb ‘written’ share the root morpheme k-t-b ‘writing-related’. For this reason, roots are used traditionally for organizing dictionaries and thesauri. That said, root semantics is often idiosyncratic. For example, the words laHm ‘meat’, laHam ‘to solder’, laH∼Am ‘butcher/solderer’ and malHama¯h ‘epic/fierce battle/massacre’ are all said to have the same root l-H-m whose meaning is left to the reader to imagine.
The pattern morpheme is an abstract template in which roots and vocalisms are inserted. We represent the pattern as a string of letters including special symbols to mark where root radicals and vocalisms are inserted. We use the numbers 1, 2, 3, 4, or 5 to indicate radical position3 and the symbol V is used to indicate the position of the vocalism. For example, the pattern 1V22V3 indicates that the second root radical is to be doubled. A pattern can include letters for additional consonants and vowels, e.g., the verbal pattern V1tV2V3.
The vocalism morpheme specifies the short vowels to use with a pattern. Traditional accounts of Arabic morphology collapse the vocalism into the pattern. The separation of vocalisms was introduced with the emergence of more sophisticated models that abstract certain inflectional features that consistently vary across complex patterns, such as voice (passive versus active).

A word stem is constructed by interleaving (aka interdigitating) the three types of templatic morphemes. For example, the word stem katab ‘to write’ is constructed from the root k-t-b, the pattern 1V2V3 and the vocalism aa.

2.1.3 Form adjustments

The process of combining morphemes can involve a number of phonological, morphological and orthographic rules that modify the form of the created word; it is not always a simple interleaving and concatenation of its morphemic components. These rules complicate the process of analyzing and generating Arabic words.

One example is the feminine morpheme, ة+ +¯h (Ta-Marbuta, [lit.

tied T]), which is turned into ت+ +t (also called Ta-Maftuha [lit., open T]) when followed by a possessive clitic: أميرة+هم Âamiyra¯hu+hum ‘princess+their’ is realized as أميرتهم Âamiyratuhum ‘their princess’. We refer to the ت+ +t form of the morpheme ة+ +¯h, as its allomorph. Similarly, by analogy to allophones and phonotactics, we can talk about morphotactics, as the contextual conditions that cause a morpheme to realize as one of its allomorphs.

2.2 Functional morphology

In functional morphology, we study words in terms of their morpho-syntactic and morpho-semantic behavior as opposed to the form of the morphemes they are constructed from. We distinguish three functional operations:

derivation,
inflection and
cliticization.

The distinction between these three operations in Arabic is similar to that in other languages. This is not surprising since functional morphology tends to be a more language-independent way of characterizing words. The next four sections discuss derivational, inflectional and cliticization morphology in addition to the central concept of the lexeme.

2.2.1 Derivational morphology

Derivational morphology is concerned with creating new words from other words, a process in which the core meaning of the word is modified. For example, the Arabic kAtib ‘writer’ can be seen as derived from the verb (to write katab the same way the English writer can be seen as a derivation from write.

Derivational morphology usually involves a change in part-of-speech (POS). The derived variants in Arabic typically come from a set of relatively well-defined lexical relations, e.g., location, time, actor/doer/active participle and actee/object/passive participle among many others.

The derivation of one form from another typically involves a pattern switch. In the example above, the verb (katab has the root k-t-b and the pattern 1a2a3; to derive the active participle of the verb, we switch in the pattern 1A2i3 to produce the form kAtib ‘writer’.

Although compositional aspects of derivations do exist, the derived meaning is often idiosyncratic. For example, the masculine noun maktab ‘office/bureau/agency’ and the feminine noun maktaba¯h ‘library/bookstore’ are derived from the root k-t-b ‘writing-related’ with the pattern+vocalism ma12a3, which indicates location.

The exact type of the location is thus idiosyncratic, and it is not clear how the nominal gender difference can account for the semantic difference.

آخر تعديل: الاثنين، 11 مارس 2024، 4:44 AM