BLOOM – (BigScience Large Open-science Open-access Multilingual Language Model) is the new language model was developed over the last year by over 1,000 volunteer researchers as part of a project called BigScience, which was coordinated by the AI startup Hugging Face and funded by the French government. Unlike other, more well-known large language models, such as OpenAI’s GPT-3 and Google’s LaMDA, BLOOM is intended to be as transparent as possible, with researchers disclosing details about the data it was trained on, the challenges encountered while developing it, and the method they used to evaluate its performance.
Compared to OpenAI’s 175-billion-parameter GPT-3, BLOOM has 176 billion parameters, and BigScience claims that it offers similar levels of accuracy and toxicity similar to other models of the same size. It can generate coherent text in 46 languages and 13 programming languages that are nearly indistinguishable from the human-written text. BLOOM is the first large language model of this size for languages like Spanish and Arabic.
Introducing a new era of open-source LLMs
The BigScience research project began in early 2021 and involved over 1000 researchers from 60+ countries and 250+ institutions. The Jean Zay supercomputer in the south of Paris, France, was used to train the BLOOM model.
Hugging Face bootstrapped and led the BigScience collaboration that created BLOOM, with strong support from the following teams:
- GENCI, the IDRIS team at the CNRS
- The Megatron team at NVIDIA, and
- The Deepspeed team at Microsoft
Is BLOOM non-toxic?
Hugging Face utilized Nvidia’s Megatron and Microsoft’s DeepSpeed open-source projects, which are based on the open-source PyTorch machine learning framework, as explained by Le Scao, a researcher at Hugging Face. The researchers created a fork of BLOOM from the Megatron and DeepSpeed projects so that it could respond in multiple languages. However, as the most important ethical question arises – whether it will become prejudiced and biased – it is reasonable to assume its fairness because the model was developed in the open and utilizes its own open license modeled on the Responsible AI license.
We’re attempting to define open source in the context of large AI models because they don’t work like software, Le Scao explained. Looking for ethics is pointless when the entire point of making it open source is to comprehend the language models in their entirety. The only issue is whether it can be abused by malicious actors. The goal of the licensing for BLOOM was to make the model as open as possible, while still retaining some control over the use cases that organizations have for the model, says Leo Scao.
Does this imply that BLOOM has the potential to dominate the LLM domain? Experts appear to be skeptical that it will result in significant changes. OpenAI, Google, and Microsoft are still out in front, Liang says. They are optimistic about it finding a place in the LLM space by assisting people in scrutinizing the mechanism of language models, which have inherent biases and are monopolized by large players. BLOOM is also likely to contain inaccuracies and biased language, says Margaret Mitchell, an AI researcher and ethicist at Hugging Face.