Generative Language Models

Deep Learning Models for NLP have to cover the syntactic (POS, normalization, lemmatization, etc.), semantic (Word Sense Disambiguation, Named Entity Recognition, concept extraction, etc.) and pragmatic (e.g., Sarcasm detection, Aspect extraction, Polarity, etc.) layers. Networks that operate on so many layers and cover multiple aspects of language are called Language Models (LMs) and form the backbone of modern NLP. Many LMs are also generative models as they can easily discover and learn the patterns from inputs and use them to generate new examples. They can also be used to generate adversarial examples, if needed, leading to a more balanced training than if classic unsupervised methods are used. Some well-known generative models include GANs and Transformers.

Transformers are considered the de facto architecture today. A Transformer contains a series of self-attention layers that are distributed through its various components. Self-attention is a mechanism that computes a representation of a sequence from a set of different positions of the same sequence. The Transformer model itself is simple and consists of pairs of encoders and decoders. Encoders encapsulate layers of self-attention coupled with feed-forward layers, whereas decoders encapsulate self-attention layers followed by encoder-decoder attention and feed-forward layers. The attention computation is done in parallel and the results are then combined. The result is termed a multi-head attention, and it provides the model with the ability to orchestrate information from different representation subspaces (e.g., multiple weight matrices) at various positions (e.g., different words in a sentence). Its outputs are fed either to other encoders or into decoders, depending on the architecture. There is no fixed number of encoders and decoders which can be included in this architecture, but they will typically be paired. In newer architectures, encoders and decoders can also be used for different tasks, e.g., encoder for Question Answering and decoder for Text Comprehension.

BERT is a Transformers model pre-trained on a large corpus of multilingual data in a self-supervised fashion. It was pre-trained (e.g., trained only on raw data without any kind of labeling) with two objectives:

Masked Language Modeling (MLM): The model randomly masks 15% of the words in the input sentence then runs the entire masked sentence through the model and predicts the masked words. Recurrent Neural Networks (RNNs) usually see the words one after the other, whereas autoregressive models like GPT internally mask the future tokens. The masking allows the model to learn a bidirectional representation of the sentence.
Next sentence prediction (NSP): Given two consecutive sentences A and B in the input, the models predict if sentence B follows from sentence A or not.
Using these mechanisms, the model learns an inner representation of the languages from the training set that can then be used to extract features useful for downstream tasks (classification, named entity recognition, machine translation, etc).