TAL23-24: Chapter 1: Introduction to natural language processing

1. Introduction

In our current days, we witness in our daily life an extensive use of systems equipped with interface in natural language. These systems are able to accept, understand and manipulate the data expressed in human language. The interaction in natural language between the human user and the system is so fluid that the human believes that his interlocutor is also human when in reality he is a computer system.

This trick has always been dreamed of by computer scientists, they always dreamed by a system that talks to humans in their native languages. Further more, Alan Turing, the founder of computer science set a condition for granting the attribute of intelligence to a system if this system is proficient in understanding and using the human language.

2. Definitions

2.1. What is natural language processing (NLP)?

[by IBM] Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.

NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.

2.2. What is Turing test?

The Turing Test involves three players: a computer, a human respondent and a human interrogator. All three are placed in separate rooms or in the same room but physically separated by terminals.

The interrogator asks both players a series of questions in natural language and, after a period, tries to determine which player is the human and which is the computer.

If the interrogator fails to determine which player is which, the computer is declared the winner and the machine is described as being able to think.

The Turing test shows the interest of natural language in artificial intelligence, because it plays a decisive role in defining a system whether it is intelligent or not.

3. Scientific approaches in NLP

In this section, we consider what are the scientific approaches to solve a natural language problem. The literature of this filed (NLP) tells us that there are three historical major approaches which are:

3.1. Symbolic approach

The symbolic approach in Natural Language Processing (NLP) refers to an approach that relies on symbolic representation and manipulation of linguistic entities using formal symbols and rules. Historically, the symbolic approach to NLP is the first used way to solve a natural language problems. It begun with computer science in its early days and it is also known as "linguistic knowledge based approach".

Linguistic knowledge-based systems are designed to capture the knowledge of linguistic human experts and implement it as software systems.

It contrasts with statistical or machine learning approaches that learn patterns directly from data. In symbolic NLP, the emphasis is on explicit representation of linguistic knowledge and the use of rules for language understanding and generation. Example of symbolic systems:

rule-based systems,
logic-based systems such as expert systems
etc.

3.2. Statistical approach

Historically, they appeared and prospered in the mid of 1980s, the statistical approach in NLP involves using statistical models to automatically learn patterns and relationships from large amounts of language data.

Also known as "Corpus-based approach", the statistical approach depends on finding patterns in large volumes of text (corpora). By recognizing these trends, the system can develop its own understanding of human language.

Example of statistical methods in NLP:

n-gram models
Hidden Markov Models (HMM)
etc.

3.3. Neural approach

Most recent paradigm used to solve NLP problems, it involves the use of neural networks. Also known in literature as "connectionist approach", it is some time combined with the statistical paradigm.

The neural approach has significantly advanced the field of NLP, leading to breakthroughs in various language understanding and generation tasks. These models often outperform traditional methods, especially in handling complex linguistic structures and capturing contextual information effectively.

It involves the use the most recent machine learning and deep learning models, such as:

Neural Networks
Word Embeddings
Recurrent Neural Networks
Transformer Models
Sequence-to-Sequence Models
Transfer Learning and Pre-trained Models

While symbolic approaches were dominant in the early days of NLP, they have faced challenges, especially in handling the variability and complexity of natural language. Modern NLP systems often combine symbolic approaches with statistical and machine learning methods to achieve better performance, leveraging the strengths of both paradigms. This hybrid approach is commonly known as "statistical-symbolic integration" in NLP research.

Modifié le: lundi 12 février 2024, 04:12