The transformer machine learning model has emerged as one of the main highlights of recent advances in deep learning and deep neural networks. It is primarily used for advanced natural language processing applications. It is being used by Google to improve its search engine results. Transformers were used by OpenAI to create its well-known GPT-2 and GPT-3 models.
The transformer architecture has evolved and branched out into many different variants since its debut in 2017, expanding beyond language tasks into other areas. They’ve been used to forecast time series. They are the driving force behind DeepMind’s protein structure prediction model, AlphaFold. Transformers are the foundation of OpenAI’s source code–generation model, Codex. Transformers have recently made their way into computer vision, where they are gradually replacing convolutional neural networks (CNN) in complex tasks.
The Transformer Architecture
The Transformer architecture is based on an encoder-decoder structure, but it does not rely on recurrence or convolutions to generate output.
In a nutshell, the encoder, which is located on the left half of the Transformer architecture, is responsible for mapping an input sequence to a sequence of continuous representations, which is then fed into a decoder.
The decoder, located on the right half of the architecture, receives the encoder output as well as the decoder output from the previous time step to generate an output sequence.
What Makes Transformers Exciting, and How Do They Work?
The traditional feed-forward neural network is not intended to keep track of sequential data and instead maps each input into an output. This works for tasks like image classification but fails for sequential data like text. A text-processing machine learning model must not only compute each word but also consider how words appear in sequences and relate to one another. Words’ meanings can change depending on what comes before and after them in a sentence.
The Transformer Model’s Attention Layers
Once the sentence has been transformed into a list of word embeddings, it is fed into the encoder module of the transformer. The transformer, unlike the RNN and LSTM models, does not receive one input at a time. It can process an entire sentence’s worth of embedding values in parallel. As a result, transformers are more compute-efficient than their predecessors and can examine the context of the text in both forward and backward sequences.
The transformer uses “positional encoding” to preserve the sequential nature of the words in the sentence, which essentially means that it modifies the values of each embedding vector to represent its location in the text.
The input is then passed to the first encoder block, which processes it via an “attention layer.” The attention layer attempts to identify the relationships between the words in a sentence. Consider the following sentence: The big black cat crossed the road after dropping a bottle on its side. The model must associate “it” with “cat” and “its” with “bottle” in this case. As a result, it should create new associations such as “big” and “cat” or “crossed” and “cat.”
In other words, the attention layer takes a list of word embeddings that represent the values of individual words and generates a list of vectors that represent both individual words and their relationships to one another. The attention layer contains several “attention heads,” each of which can capture different types of word-to-word relationships.
The attention layer’s output is fed into a feed-forward neural network, which converts it to a vector representation and sends it to the next attention layer. Transformers use many attention and feed-forward layers to gradually capture more complex relationships.
The decoder module’s job is to convert the encoder’s attention vector into output data (e.g., the translated version of the input text). During the training phase, the decoder has access to both the encoder’s attention vector and the expected outcome (e.g., the translated string).
To process the expected outcome and generate attention vectors, the decoder employs the same tokenization, word embedding, and attention mechanism as the encoder. This attention vector and the attention layer are then passed to the encoder module, which establishes relationships between the input and output values.
This is the section of the translation application where words from the source and destination languages are mapped to one another. The decoder attention vector, like the encoder module, is processed by a feed-forward layer. Its output is then mapped to a very large vector equal to the target data’s size (in the case of language translation, this can span across tens of thousands of words).