Researchers at Google Brain have open-sourced the Switch Transformer, a natural-language processing (NLP) AI model. The model scales up to 1.6T parameters and improves training time up to 7x compared to the T5 NLP model, with comparable accuracy.
The team described the model in a paper published on arXiv. The Switch Transformer uses a mixture-of-experts (MoE) paradigm to combine several Transformer attention blocks. Because only a subset of the model is used to process a given input, the number of model parameters can be increased while holding computational cost steady. Compared to Google’s state-of-the-art T5 NLP model, baseline versions of the Switch Transformer can achieve target pre-training perplexity metrics in 1/7 the training time. The 1.6T-parameter version outperforms a T5-XXL on the perplexity metric, with comparable or better performance on downstream NLP tasks, despite training on half the data.
The Transformer architecture has become the primary deep-learning model used for NLP research. Recent efforts have focused on increasing the size of these models, measured in number of parameters, with results that can exceed human performance. A team from OpenAI, creators of the GPT-3 model, found that NLP performance does indeed scale with number of parameters, following a power-law relationship. In developing the Switch Transformer, the Google Brain team sought to maximize parameter count while keeping constant the number of FLOPS per training example and training on “relatively small amounts of data.”
To achieve this, the model uses a mixture of experts (MoE) scheme. MoE was developed in 1991 by a research team that included deep-learning pioneer and Switch Transformer co-creator Geoff Hinton, then at University of Toronto and now at Google Brain. In 2017, Hinton and Google Brain colleagues used MoE to create an NLP model based on a recurrent neural network (RNN) of 137B parameters which achieved state-of-the-art results on language modeling and machine translation benchmarks.
The Switch Transformer uses a modified MoE algorithm called Switch Routing: instead of activating multiple experts and combining their output, Switch Routing chooses a single expert to handle a given input. This simplifies the routing computation, and reduces communication costs since individual expert models are hosted on different GPU devices. One drawback to the scheme, however, is an increased chance of training instability, especially when using reduced-precision arithmetic, due to the “hard” switching decisions. The team mitigated this by reducing the scale factor for initializing the model parameters.
The team used Mesh-TensorFlow (MTF) to train the model, taking advantage of data- and model-parallelism. To investigate the performance of the architecture at different scales, the team trained models of different sizes, from 223M parameters up to 1.6T parameters, finding that the “most efficient dimension for scaling” was the number of experts. Model performance on pre-training and downstream NLP tasks was compared to T5 models requiring similar FLOPs per sample. Baseline-sized Switch Transformer models outperformed T5 on GLUE, SuperGLUE, and SQuAD benchmarks, while achieving a 7x speedup on pre-training time. The large-scale Switch Transformer, with 1.6T parameters and 2048 experts, outperformed a 13B-parameter T5 model in pre-training perplexity, while finishing in 1/4 the time.
In a discussion on Reddit, commenters pointed out that the Google Brain team did not compare their model’s performance to GPT-3, speculating this was due to lack of information in OpenAI’s published result. Another commenter noted:
[T]he time to accuracy gains are remarkable, albeit coming at a cost for hardware requirements. All these are non-issues for Google, but I can see why OpenAI isn’t too keen on these models, at least, so far.
Although Google has not released the pre-trained model weights for the Switch Transformer, the implementation code is available on GitHub.
This article has been published from the source link without modifications to the text. Only the headline has been changed.