Tokenization

Tokenization is well-established and well-understood for artificial languages such as programming languages. However, such artificial languages can be strictly defined to eliminate lexical and structural ambiguities; we do not have this luxury with natural languages, in which the same character can serve many different purposes and in which the syntax is not strictly defined. Many factors can affect the difficulty of tokenizing a particular natural language.


One fundamental difference exists between tokenization approaches for space-delimited languages and approaches for unsegmented languages.

In space-delimited languages, such as most European languages, some word boundaries are indicated by the insertion of whitespace. The character sequences delimited are not necessarily the tokens required for further processing, due both to the ambiguous nature of the writing systems and to the range of tokenization conventions required by different applications. 

In unsegmented languages, such as Chinese and Thai, words are written in succession with no indication of word boundaries. The tokenization of unsegmented languages therefore requires additional lexical and morphological information.


In both unsegmented and space-delimited languages, the specific challenges posed by tokenization are largely dependent on both the writing system (logographic, syllabic, or alphabetic)1 and the typographical structure of the words. There are three main categories into which word structures can be placed, and each category exists in both unsegmented and space-delimited writing systems. 

The morphology of words in a language can be 

  • isolating, where words do not divide into smaller units; 
  • agglutinating (or agglutinative), where words divide into smaller units (morphemes) with clear boundaries between the morphemes; 
  • or inflectional, where the boundaries between morphemes are not clear and where the component morphemes can express more than one grammatical meaning. 
While individual languages show tendencies toward one specific type (e.g., Mandarin Chinese is predominantly isolating, Japanese is strongly agglutinative, and Latin is largely inflectional), most languages exhibit traces of all three.
A fourth typological classification frequently studied by linguists, polysynthetic, can be
considered an extreme case of agglutinative, where several morphemes are put together to form complex words that can function as a whole sentence.
Chukchi and Inuktitut are examples of polysynthetic languages, and some research in machine translation has focused on a Nunavut Hansards parallel corpus of Inuktitut and English.
Since the techniques used in tokenizing space-delimited languages are very different from those used in tokenizing unsegmented languages, we discuss the techniques separately in different sections.

Tokenization in Space-Delimited Languages
In many alphabetic writing systems, including those that use the Latin alphabet, words are separated by whitespace. Yet even in a well-formed corpus of sentences, there are many issues to resolve in tokenization.
Most tokenization ambiguity exists among uses of punctuation marks, such as periods, commas, quotation marks, apostrophes, and hyphens, since the same punctuation mark can serve many different functions in a single sentence, let alone a single text.
Consider example sentence from the Wall Street Journal (1988).

Clairson International Corp. said it expects to report a net loss for its second quarter ended March 26 and doesn’t expect to meet analysts’ profit estimates of $3.9 to $4 million, or 76 cents a share to 79 cents a share, for its year ending Sept. 24.

This sentence has several items of interest that are common for Latinate, alphabetic, space-delimited languages. 
  1. First, it uses periods in three different ways : 
    1. within numbers as a decimal point ($3.9),
    2. to mark abbreviations (Corp. and Sept.),
    3. and to mark the end of the sentence, in which case the period following the number 24 is not a decimal point.
  2. The sentence uses apostrophes in two ways: 
    1. to mark the genitive case (where the apostrophe denotes possession) in analysts’ 
    2. and to show contractions (places where letters have been left out of words) in doesn’t.
The tokenizer must thus be aware of the uses of punctuation marks and be able to determine when a punctuation mark is part of another token and when it is a separate
token.

Tokenization in Unsegmented Languages
The nature of the tokenization task in unsegmented languages like Chinese, Japanese, and Thai is fundamentally different from tokenization in space-delimited languages like English. 
The lack of any spaces between words necessitates a more informed approach than simple lexical analysis. 
The specific approach to word segmentation for a particular unsegmented language is further limited by the writing system and orthography of the language, and a single general approach has not been developed.

Common Approaches

A very common approach to word segmentation is to use a variation of the maximum matching algorithm, frequently referred to as the greedy algorithm

The greedy algorithm starts at the first character in a text and, using a word list for the language being segmented, attempts to find the longest word in the list starting with that character.

If a word is found, the maximum-matching algorithm marks a boundary at the end of the longest word, then begins the same longest match search starting at the character following the match. If no match is found in the word list, the greedy algorithm simply segments that character as a word and begins the search starting at the next character.
A variation of the greedy algorithm segments a sequence of unmatched characters as a single word; this variant is more likely to be successful in writing systems with longer average word lengths. 

In this manner, an initial segmentation can be obtained that is more informed than a simple character-as-word approach. The success of this algorithm is largely dependent on the word list.

A variant of the maximum matching algorithm is the reverse maximum matching algorithm, in which the matching proceeds from the end of the string of characters, rather than the beginning.

Tokenization of Arabic

The common wisdom in NLP is that tokenization of Arabic words through decliticization and reductive orthographic normalization is helpful for many applications such as language modeling (LM), IR and statistical MT (SMT). Tokenization and normalization reduce sparsity and perplexity and decrease the number of out-of-vocabulary (OOV) words.

We distinguish between tokenization schemes and tokenization techniques.

  • The scheme defines what the target tokenization is; 
  • whereas the technique is about how to implement it.

Tokenization schemes vary along two dimensions: what to split (segmentation) and what form to represent the various split parts (regularization). There is a very large number of possible tokenization schemes.

In the context of IR, the form of tokenization often used is called stemming. In stemming split clitics and other non-core morphemes are simply deleted.
Tokenization techniques can be as simple as a greedy regular expression or more complex involving morphological analysis and disambiguation. 

Since morphological ambiguity in Arabic is rampant, the more complex a scheme the harder it is to correctly tokenize in context. The more complex techniques have been shown helpful in that regard; however, it should be noted that in certain contexts less is more: e.g., phrase-based SMT only benefits from complex tokenizations with little training data where sparsity is a big problem. As more training data is introduced, complex tokenizations actually start to hurt compared to simpler tokenizations.

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their base or root form (stem). The root form may not necessarily be a valid word on its own, but it represents the core meaning of the word. Stemming is commonly used in text mining, information retrieval, and other NLP tasks to normalize words with similar meanings and reduce the complexity of the vocabulary.

For example, applying stemming to words like "running," "runs," and "runner" would result in the common stem "run." Similarly, stemming "playing," "played," and "plays" would yield the stem "play."

Stemming algorithms operate by applying heuristic rules to strip affixes from words. These rules are based on linguistic principles and patterns in the language. While stemming can help reduce word variations and improve text processing efficiency, it may also result in errors or inaccuracies due to overgeneralization or undergeneralization. For instance, the stem "comput" might be produced for both "compute" and "computer," which could lead to ambiguity in certain contexts.

Stemming algorithms for English

When defining a stemming algorithm, a first approach will only remove inflectional suffixes. For English, such a procedure conflates singular and plural word forms (“car” and “cars”) as well as removing the past participle ending “-ed” and the gerund or present participle ending “-ing” (“eating” and “eat”).
Stemming schemes that remove only morphological inflections are termed as “light” suffix-stripping algorithms, while more sophisticated approaches have also been proposed to remove derivational suffixes (e.g., “-ment,” “-ably,” “-ship” in the English language). For example, Lovins (1968) is based on a list of over 260 suffixes, while Porter’s algorithm (Porter, 1980) looks for about 60 suffixes. 

In such cases, suffix removal is also controlled through the adjunct of quantitative restrictions (e.g., “-ing” would be removed if the resulting stem had more than three letters as in “running,” but not in “king”) or qualitative restrictions (e.g., “-ize” would be removed if the resulting stem did not end with “e” as in “seize”). Moreover, certain
ad hoc spelling correction rules are used to improve the conflation accuracy (e.g., “running” gives “run” and not “runn”), due to certain irregular grammar rules usually applied to facilitate easier pronunciation.
Of course, one should not stem proper nouns such as “Collins” or “Hawking,” at least when the system can recognize them.

Common stemming algorithms include:

  1. Porter Stemmer: Developed by Martin Porter in 1980, the Porter Stemmer is one of the most widely used stemming algorithms. It applies a series of heuristic rules to remove common English affixes.

  2. Snowball Stemmer (Porter2): Also developed by Martin Porter, the Snowball Stemmer is an improvement over the original Porter Stemmer and supports multiple languages.

  3. Lancaster Stemmer: This stemming algorithm is more aggressive than the Porter Stemmer and may produce shorter stems. It was developed by Chris D. Paice in 1990.

Lemmatization

Lemmatization is the process of reducing words to their base or dictionary form, known as a lemma. Unlike stemming, which simply removes affixes to produce a root form, lemmatization considers the context and meaning of a word to determine its canonical form.

For example:

  • The lemma of "running," "runs," and "ran" is "run."
  • The lemma of "better" is "good."
  • The lemma of "wolves" is "wolf."

Lemmatization helps to normalize words by reducing them to their dictionary form, which can improve the accuracy of NLP tasks such as text analysis, information retrieval, and machine learning. By converting different inflected forms of a word into a common lemma, lemmatization reduces the vocabulary size and helps to ensure that words with similar meanings are treated identically.

Lemmatization typically involves the use of a dictionary or vocabulary of words along with morphological analysis to determine the lemma of a given word. This process takes into account factors such as part of speech, tense, and context to accurately identify the base form of each word.


Lemmatization resources and tolls

There are lemmatization algorithms and tools specifically designed for the English language. These algorithms utilize linguistic resources such as dictionaries, word lists, and rules to map words to their corresponding lemmas.

Here are some commonly used lemmatization algorithms and tools for English:

  1. WordNet: WordNet is a lexical database of the English language that provides a hierarchical structure of words and their semantic relationships. It is often used in lemmatization to map words to their base forms.

  2. NLTK (Natural Language Toolkit): NLTK is a popular Python library for NLP that provides lemmatization functionality. It includes a WordNet-based lemmatizer that utilizes WordNet's lexical database to map words to their lemmas.

  3. spaCy: spaCy is another widely used Python library for NLP that includes a lemmatization component. spaCy's lemmatizer is based on machine learning models and linguistic rules trained on large annotated datasets.

  4. Stanford CoreNLP: Stanford CoreNLP is a suite of NLP tools developed by the Stanford NLP Group. It includes a lemmatization module that leverages statistical models and linguistic rules to perform lemmatization.

  5. Pattern: Pattern is a Python library that provides various NLP tools, including a lemmatizer. The lemmatizer in Pattern utilizes language-specific rules and patterns to map words to their lemmas.

Arabic lemmatization

Arabic lemmatization presents unique challenges due to the rich morphology and complex word structures of the Arabic language. In Arabic, words can have numerous inflectional and derivational forms, making it essential to have specialized algorithms and tools for lemmatization.

Some specific resources and tools for Arabic lemmatization include:

  • BAMA (Buckwalter Arabic Morphological Analyzer): BAMA is a morphological analyzer for Arabic that provides lemmatization and morphological analysis capabilities. It is widely used in Arabic NLP research and applications.

  • Arabic Corpora and Lexicons: Linguistic resources such as Arabic corpora and lexicons contain annotated text data and lexical information that can be used for lemmatization and language processing tasks.

  • Arabic NLP Libraries: Libraries such as Farasa, MADAMIRA, and Kalimat offer NLP functionalities for Arabic, including lemmatization, tokenization, and morphological analysis.




------------------------------------------------------------------------------------------------------

Three types of writing systems

1- logographic : where a large number (often thousands) of individual symbols represent words (Chinese, Japanese, Korean, Vietnamese, hieroglyphics)
2- syllabic: in which individual symbols (fewer than 100 symbols) represent syllables
3- alphabetic: in which individual symbols (fewer than 100 symbols) represent sounds

Last modified: Monday, 22 April 2024, 7:22 AM