Why are Transformers deemed as an Upgrade from RNNs and LSTM?

Artificial intelligence is a disruptive technology that finds more applications each day. But with each new innovation in artificial intelligence technologies like machine learning, deep learning, neural network, the possibilities to scale a new horizon in tech widens up.

In the past few years, a form of neural network that is gaining popularity, i.e., Transformers. They employ a simple yet powerful mechanism called attention, which enables artificial intelligence models to selectively focus on certain parts of their input and thus reason more effectively. The attention-mechanism looks at an input sequence and decides at each step which other parts of the sequence are important.

Basically, it aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Considered as a significant breakthrough in natural language processing (NLP), its architecture is a tad different than recurrent neural networks (RNN) and Convolutional Neural Networks (CNNs). Prior to its introduction in a 2017 research paper, the former state-of-the-art NLP methods had all been based on RNN (e.g., LSTMs). RNN typically processes data in a loop-like fashion (sequentially), allowing information to persist. However, the problem with RNN is that in case the gap between the relevant information and the point where it is needed becomes very large, the neural network becomes very ineffective. This means, RNN becomes incapable of handling long sequences like gradient vanish and long dependency.

To counter this, we have attention and LSTM mechanisms. Unlike RNN, LSTM leverages, Gate mechanism to determine which information in the cell state to forget and which new information from the current state to remember. This enables it to maintain a cell state that runs through the sequence. It also allows, it to selectively remember things that are important and forget ones not so important.

Both RNNs and LSTM are popular illustrations of sequence to sequence models. In simpler words, Sequence-to-sequence models (or seq2seq) are a class of machine learning models that translates an input sequence to an output sequence. Seq2Seq models consist of an Encoder and a Decoder. The encoder model is responsible for forming an encoded representation of the words (latent vector or context vector) in the input data. When a latent vector is passed to the decoder, it generates a target sequence by predicting the most likely word that pairs with the input word for the respective time steps. The target sequence can be in another language, symbols, a copy of the input, etc. These models are generally adept at translation, where the sequence of words from one language is transformed into a sequence of different words in another language.

The same 2017 research paper, titled “Attention is All You Need” by Vaswani et al., from Google, mentions that RNN and LSTM counter the problem of sequential computation that inhibits parallelization. So, even LSTM fails when sentences are too long. While a CNN based Seq2Seq model can be implemented in parallel, and thus reducing time spent on training in comparison with RNN, it occupied huge memory.

Transformers can get around this lack of memory by perceiving entire sequences simultaneously. Besides, they enable parallelization of language processing, i.e., all the tokens in a given body of text are analyzed at the same time rather than in sequence. Though the transformer depends on transforming one sequence into another one with the help of two parts (Encoder and Decoder), it still differs from the previously described/existing sequence-to-sequence models. This is because as mentioned above, they employ attention mechanism.

The attention mechanism emerged as an improvement over the encoder decoder-based neural machine translation system in natural language processing. It also allows a model to consider the relationships between words regardless of how far apart they are – addressing the long-range dependencies issues. It achieves this by enabling the decoder to focus on different parts of the input sequence at every step of the output sequence generation. Now, dependencies can be identified and modeled irrespective of their distance in the sequences.

Unlike previous seq2seq models, Transformers do not discard the intermediate states and nor use the final state/context vector when initializing the decoder network to generate predictions about an input sequence. Moreover, by processing sentences as a whole and learning relationships, they avoid recursion.

Some of the popular Transformers are BERT, GPT-2 and GPT-3. BERT or Bidirectional Encoder Representations from Transformers was created and published in 2018 by Jacob Devlin and his colleagues from Google.  OpenAI’s GPT-2 has 1.5 billion parameters, and was trained on a dataset of 8 million web pages. Its goal was to predict the next word in 40GB of Internet text. In contrast, GPT-3 was trained on roughly 500 billion words and consists of 175 billion parameters. It is said that, GPT-3 is a major leap in transforming artificial intelligence by reaching the highest level of human-like intelligence through machine learning. We also have Detection Transformers (DETR) from Facebook which was introduced for better object detection and panoptic segmentation.

This article has been published from the source link without modifications to the next. Only the headline has been changed.

Source link