New age of monster AI models

GPT3, OpenAI’s program to mimic human language, started a new trend in artificial intelligence for ever larger models. How big will they be and at what price?

It was a year of great AI models.

When OpenAI released GPT3 in June 2020, the apparent understanding of neural network language was mind-boggling. It could generate compelling sentences, chat to people, and even auto-complete code. GPT3 was also monstrous, bigger than any other neural network, and started a whole new trend in AI where bigger is better.

Although GPT3 tends to mimic the bias and toxicity of the online text it was trained on, and although it takes a tremendous and unsustainable amount of computing power to teach such a large model its tricks, we chose GPT3 as one of ours groundbreaking technologies decided 2020 – for better and for worse.

But the impact of GPT3 became even more apparent in 2021. This year saw an increase in large AI models built by several technology companies and leading AI laboratories, many of which exceeded GPT3 in size and capacity. How big can they get and at what price?

GPT3 caught the world’s attention not only for its capabilities, but also for how it worked. The surprising leap in performance, particularly GPT3’s ability to generalize to linguistic tasks for which it wasn’t specifically trained, didn’t come from better algorithms (although it is heavily based on a type of neural network invented by Google in 2017 called a transformer) but because of its size.

We thought we needed a new idea, but we got it by scale, said Jared Kaplan, an OpenAI researcher and one of the GPT3 designers, in a December panel at NeurIPS, a leading conference on artificial intelligence.

“We continue to see that the hyperscaling of AI models leads to better performance, apparently with no end in sight,” wrote two Microsoft researchers in an October blog post discussing the company’s massive MegatronTuring NLG model announced that was developed in collaboration with Nvidia.

What does it mean for a model to be large? The size of a model (a trained neural network) is measured by the number of its parameters. These are the values ​​in the network that are repeatedly adjusted during training and then used to make the model predictions. In general, the more parameters a model has, the more information it can take from its training data and the more accurate its predictions will be for the new data.

GPT3 has 175 billion parameters, ten times more than its predecessor GPT-2, but GPT3 is dwarfed by the class of 2021. Jurassic1, a commercially available large language model introduced in September by the US startup AI21 Labs, outperformed GPT3 with 178 billion parameters. Gopher, a new model released by DeepMind in December, has 280 billion parameters, while Megatron-Turing has NLG 530 billion.Google’s SwitchTransformer and GLaM models have one and 1.2 trillion parameters, respectively.

The trend is not just in the US. This year, Chinese tech giant Huawei built a 200 billion parameter language model called PanGu, and Inspur, another Chinese company, built Yuan 1.0, a model with 245 billion parameters. Baidu and Peng Cheng Laboratory, a research institute in Shenzhen, announced PCLBAIDU Wenxin, a 280 billion parameter model that Baidu is already using in a variety of applications including internet search, newsfeeds and smart speakers, and Beijing AI Academy announced Wu Dao 2.0 with 1.75 trillion parameters.

Meanwhile, the South Korean internet search company Naver announced a model called HyperCLOVA with 204 billion parameters.

Each of these is a remarkable feat of engineering. First off, training a model with more than 100 billion parameters is a complex installation problem: hundreds of individual GPUs, the hardware of choice for deep neural network training, need to be connected and synchronized, and the training data distribution needs to be connected to pieces and distribute them in the right order at the right time.

The large language models have become prestige projects that demonstrate the technical capabilities of a company. However, few of these new models get research beyond repeating the demonstration that upscaling works.

There are a handful of new features. Once trained, Google’s SwitchTransformer and GLaM use a fraction of their parameters to make predictions, saving computing power. PCLBaidu Wenxin combines a GPT3 style model with a knowledge graph, a technique used in old school symbolic AI to store facts. And with Gopher, DeepMind released RETRO, a language model with only 7 billion parameters. It competes with another 25 times that of cross-referencing a document database when generating text. This makes RETRO’s training cheaper than its huge competitors.

However, despite the impressive results, researchers still do not fully understand why increasing the number of parameters leads to better performance, nor do they have a solution to the toxic language and misinformation that these models learn and repeat. Article describing the technology: “Internet-trained models have Internet-scale biases.”

Deepmind means that the retro database can be easily filtered for a malicious language than a monolithic black box model, but this is not completely tested. More information could come from the BigScience Initiative, a consortium formed by the artificial intelligence company Hugging Face and made up of approximately 500 researchers, many of whom come from large tech companies who volunteer their time to build and research an open-source language.

In an article published earlier this year, Timnit Gebru and colleagues highlighted a number of unresolved issues with GPT3-style models: “We asked if enough thought had been given about the potential risks associated with their development and strategies to mitigate them, “They wrote.

Despite all the efforts invested in developing new language models this year, AI is still trapped in the shadow of GPT3. Large format models will be the norm in 10 to 20 years, Kaplan said during the NeurIPS panel. If so, it is time for researchers to focus not only on the size of a model, but also what to do with it.

Source link