In recent pre-print research, Meta AI demonstrated a revolutionary new “Megabyte” framework for creating generative pre-trained transformer (GPT) systems.
The new architecture, described as “promising” by Andrej Karpathy of OpenAI, a former director of artificial intelligence at Tesla, is made to process vast amounts of data, including photographs, books, and video files, without the use of tokenization.
Similar to file compression, tokenization is a lossy process. GPT models translate bytes to tokens in order to process massive volumes of data. Following the transformer’s processing of the tokens, output tokens are produced and subsequently decoded.
Larger strings of data can be processed as numbers by an AI system thanks to the tokenization process. For instance, the phrase “my favorite colour is red” might be processed by OpenAI’s ChatGPT as the token string “3666, 4004, 3124, 318, 2266, 13”.
The quantity of data that modern, cutting-edge computers can process, even after tokenization, nevertheless has a hard limit. While GPT-4 has a maximum of around 32,000 tokens or 24,000 words, GPT-3.5 has a limit of little over 4,000 tokens, or about 3,000 words.
In place of tokenization, Meta’s new Megabyte system uses a cutting-edge multi-layer prediction architecture that can end-to-end model over 1 million bytes of data.
Standard 8-bit encoding is used by the majority of English language encoding systems. Each character in this paradigm requires one byte of data. As a result, an AI system that can handle 1 million bytes of data without tokenization can handle texts with 750,000 words, which is a 3,025% increase over GPT-4.
In comparison, Megabyte would be able to parse War and Peace by Leo Tolstoy in its entirety along with an additional two novels of a similar length. GPT-4 can now handle roughly ten feature-length reports in a single prompt.
Additionally, Meta’s Megabyte model outperformed current byte-based transformer models like DeepMind’s Perciever AR on both of the following ImageNet tests and benchmarks pertaining to processing audio files, either matching or surpassing them:
Megabyte uses only half the computation while maintaining PerceiverAR’s state-of-the-art performance.
The effects of this research might be widespread. Due to its strict data limitations and the effort and time needed to train algorithms, tokenization is seen as a barrier in the area.
Without tokenization, it should be able to train AI models that better accommodate non-English languages, particularly those that are difficult to express using ordinary 8-bit letters.
The democratization of these technologies could be furthered as a result, allowing for the creation of decentralized autonomous organization technologies and bitcoin trading bots in localised code all over the world.
By producing multimedia clips with roughly the same amount of time and energy as text, it would also improve the ability of models like ChatGPT to interact with image, video, and audio files.