The latest advances in Automatic Language Processing
4 questions to Mehdi Mirzapour, researcher in Natural Language Processing (NLP) at ContentSide
In the field of natural language, the technologies implemented have evolved considerably over the last two or three years.
The field of Natural Language Processing (NLP) or automatic language processing (ALP) has been broken down into several disciplines such as Natural Language Understanding (NLU) and Natural Language Generation (NLG).
Word/Document Embeddings are also essential techniques for any NLP system.
Each of these variations has distinct objectives and is based on families of algorithms and specific approaches. Mehdi Mirzapour provides an overview of the state of the art.
Natural language analysis is nowadays used in several areas. The experts speak in particular of NLU and NLG. Can you briefly present these two approaches?
It should be remembered that the analysis of natural language using software services calls on several disciplines: artificial intelligence, cognitive science and linguistics.
The linguistics used in this context is called computational. It involves representing text in digital form so that processing can be applied to it.
In this context, the NLU and the NLG represent the main components of an overall NLP system.
The NLG aims to build software systems capable of producing meaningful texts. For example, the generation of summaries from texts can be automated. This potential can be put to good use in many cases: e-commerce to generate product sheets, publishing, virtual assistants, commercial analysis, etc., or in health, to generate a patient's report in hospital from medical reports.
The ideal NLU system is designed to analyse a text to assign the best meaning among multiple possible interpretations or representations of meaning. These representations can be used for different purposes such as text classification or sequence labelling. In a simplified high-level perspective, the NLU can be seen as an inverse process of the NLG.
The NLU and NLG have been benefiting from the potential of deep learning for some years. Can you tell us about this technology and its advantages?
Both the NLU and the NLG benefit from the development of neural networks which provide the possibility of deep learning for different tasks (translation, etc.).
Several architectures of these networks, i.e. the physical organisation between the different layers of neurons, are used today: MLP (Multi-Layer Perceptron), LSTM (Long-Short-Term-Memory), GRU (Gated Recurrent Units) and CNN (Convolutional Neural Networks). In the field, the most common practice is to build models based on a combination of several of these architectures.
These approaches provide good results, which are measured in terms of the accuracy obtained, or more precisely the absence of errors. To date, the NLU has shown the most remarkable progress, particularly for classification tasks.
Based on neural networks, an approach called "Word Embedding" is now used upstream of the NLU and NLG. Can you summarise how it works and its benefits?
Remember that computers do not understand words and only work with numbers.
The first step is therefore to represent them in numerical form. Previously, researchers used a statistical approach based on the number of occurrences of each word in a document, an approach called "Bag of Words".
More advanced methods such as LSA (Latent Semantic Analysis) or TF-IDF (Term Frequency- Inverse Document Frequency) took into account the presence of words in a document and in a corpus (a set of documents).
All these representations allowed the application of different machine learning algorithms such as decision trees.
The emergence of more modern techniques, including Deep Learning, has shaped new statistical algorithms such as the static Word2Vec, Glove or FastText algorithms, in which words are represented as vectors based on co-occurrences found from large amounts of textual data.
Dubbed "Word Embedding", these approaches improve disambiguation over previous methods. These approaches are used to power NLUs and NLGs for a variety of use cases.
The most recent advances, "Dynamic Embedding" in particular launched with Google's BERT (Bidirectional Encoder Representations from Transformers) at the end of 2018, further improve the results. What are the specifics?
These transformer models, notably BERT and its variants for English and French languages (RoBERTa, ALBERT, CamemBERT and FlauBERT), are even more powerful in terms of results, as they use "Dynamic Embedding", i.e. different vector representations for different polysemous words, which are obtained via an initial training on a huge amount of data, without the need for human annotation.
To achieve good results with BERT, it is not necessary to fit it to as large a dataset as the one it was originally trained on. This saves the time and cost of creating datasets.
The BERT tokenizer does not work on words but on parts of words, which solves the problem of lack of vocabulary for unknown words.
Are you interested in this topic?