Recently, Bonaventure Dossou discovered a concerning trend in a widely used AI model. Speaking Fon, which is the language of millions of people in Benin and its neighboring nations as well as Dossou’s mother, the program called it “a fictional language.”
It has been verified that this is not an uncommon outcome. When technology so readily benefits others, Dossou is used to feeling as though his culture is hidden. Throughout his childhood, he was not exposed to Fon-language Wikipedia articles or translation apps that would have let him to speak more fluent French with his mother. “Our personhood is taken away from us by technology that views something as basic and basic as our name as an error,” Dossou said to me.
Even basic aspects of digital life—searching on Google, talking to Siri, using autocorrect, or simply typing on a smartphone—have long been closed off to much of the world. Now, the generative-AI boom, despite promises to bridge languages and cultures, may only further entrench English’s dominance in life on and off the web. English became a common tongue for business, politics, science, and entertainment with the rise of the internet and decades of American hegemony. Although more than half of all websites are in English, over 80 percent of people worldwide don’t speak the language.
This technology’s key component is scale. In order to produce the human-like language that has so impressed so many users of ChatGPT and other applications, today’s AI needs orders of magnitude more processing power and training data than earlier generations. A large portion of the data that generative AI “learns” is just online scraped. This could indicate that generative AI is more effective in the English language given the overwhelming amount of English-language content on the internet, further entrenching a cultural bias in a technology that has been promoted as having the ability to “benefit humanity as a whole.” Only a few additional languages, however, are similarly well-suited for the generative-AI era: Only ten languages are used to write over 90% of websites: English, Russian, Spanish, German, French, Japanese, Turkish, Portuguese, Italian, and Persian.
There are almost 7,000 languages spoken worldwide. 133 of them are supported by Google Translate. OpenAI, Google, and Anthropic’s chatbots are still more limited. Sara Hooker, a computer scientist and the director of Cohere for AI, the nonprofit research division of the software business Cohere, informed that “there’s a sharp cliff in performance.” Eight to ten languages are supported by the majority of the best-performing [language] models. There’s almost a void after that. The increasing prevalence of chatbots, translation tools, and voice assistants in web navigation may result in the eradication of hundreds of Indigenous and low-resource languages like Fon, as these languages do not have enough textual data to train AI models.
Ife Adebara, an AI researcher and computational linguist at the University of British Columbia, told that “many people ignore those languages, both from a linguistic standpoint and from a computational standpoint.” The desire among the younger generations to acquire the languages of their ancestors will gradually decline. Furthermore, billions of people might really be in worse shape than they are now if generative AI turns into the primary means of accessing the internet. This is not merely a matter of reproducing current problems with the web.
Adebara and Dossou collaborate with Masakhane, a group of researchers creating artificial intelligence tools for African languages. Dossou is currently a computer scientist at McGill University in Canada. On the other hand, Masakhane is a part of an expanding global initiative that is working against the clock to develop software for languages that are underrepresented on the internet, with the goal of saving them. The modeling of low-resource languages has advanced significantly in recent decades, according to Alexandra Birch, a machine-translation researcher at the University of Edinburgh, who told as much.
In a promising finding that demonstrates generative AI’s ability to surprise, computer scientists revealed that some AI programs may identify characteristics of communication that transcend a particular language. Perhaps the technology can be used to make the web more aware of lesser-known languages. A algorithm trained on a language with a large amount of data—say, English, French, or Russian—will perform better in a language with fewer resources, such as Fon or Punjabi. “Every language is going to have something like a subject or a verb,” said Antonios Anastasopoulos, a computer scientist at George Mason University. “So even if these manifest themselves in very different ways, you can learn something from all of the other languages.” Birch compared this to how a youngster who grows up knowing both English and German can move easily between the two, even if they haven’t studied straight translations between the languages—not just moving from word to word, but understanding something more fundamental about communication.
Yet, this finding might not be sufficient to change the course of events. It takes a lot of effort and time to build AI models for languages with limited resources. A huge language model with state-of-the-art performance for 101 languages—more than half of which are low-resource languages—was just provided by Cohere. There are still over 6,900 languages to go, and 3,000 workers from 119 different nations were needed for just this endeavor. Researchers often use native speakers to annotate existing material, transcribe recordings, or answer questions in order to provide training data, but this process can be time-consuming and costly. Adebara dedicated years of effort to compiling the largest and most complete training data set to date—42 gigabytes—for 517 African languages. Her dataset is only 0.4% of the biggest English training dataset that is accessible to the general public. The private databases owned by OpenAI, which are used to train software like ChatGPT, are probably far larger.
A large portion of the scant literature that is easily accessible in low-resource languages is either poorly translated or of limited use. For many years, Bible translations or websites run by missionaries, like Jehovah’s Witnesses, served as the primary textual sources for many of these low-resource languages in Africa. Even fewer are the vital instances for fine-tuning AI, which must be deliberately generated and curated (data that makes a chatbot useful, human-sounding, non-racist, etc.). Obtaining funding, computational power, and language proficiency are often equally elusive. Because there aren’t enough training examples, language models may have trouble understanding non-Latin scripts or correctly separating words in sentences written in low-resource languages, much alone those without a writing system.
The issue is that generative AI is quickly taking over the web, even while tool development for these languages is moving slowly. With the intention of generating a quick cash, synthetic information is proliferating on social media and search engines like a kind of gray goo.
The majority of websites rely on clicks and attention in order to generate revenue from adverts and subscription services. Already, a vast amount of the internet is made up of content that has little literary or informative value at all—an unending sea of garbage that is there merely to be clicked on. What better method to reach a wider audience than using the first AI program that appears on a Google search results page to translate content into another language?
These translation programs, which are already occasionally inaccurate, are particularly problematic when it comes to languages with little resources. In fact, earlier this year, researchers published preliminary findings showing that, when compared to websites in English or other languages with more resources, online content in these languages was more likely to have been (badly) translated from another source and that the original content itself was more likely to be optimized for clicks. Training on copious amounts of this erroneous material will render low-resource language versions of ChatGPT, Gemini, and Claude even worse, like asking someone to make a fresh salad using only one pound of ground beef. Mehak Dhaliwal, a computer scientist at UC Santa Barbara and one of the study’s authors, told me, “You are already training the model on incorrect data, and the model itself tends to produce even more incorrect data,” which could expose speakers of low-resource languages to false information. Furthermore, a feedback cycle of declining performance for thousands of languages could result from those outputs, which would be disseminated throughout the web and probably utilized to train future language models.
Think such a situation where “you want to do a task, and you want a machine to do it for you,” explained to me David Adelani, a University College London DeepMind research fellow. You won’t be able to achieve this if you try to convey this in your native tongue and the technology doesn’t comprehend. Many of the things that individuals in wealthy nations find easier to achieve, you won’t be able to accomplish. The linguistic hurdles that already exist on the web will all increase: AI won’t be able to help you with child tutoring, work memo drafting, book summaries, research, calendar management, vacation planning, tax form filing, web browsing, or any other task. Even in cases where AI models are able to handle low-resource languages, the programs become much more expensive to operate due to the increased memory and processing power requirements. This leads to worse outcomes at higher costs.
No matter how good at grammar they are, AI models may also lack cultural context and nuance. Because the same Yoruba phrase can communicate either meaning, Adelani noted that such algorithms have long converted “good morning” to a variant of “someone has died.” Translations from English have been utilized to produce training data for languages spoken by hundreds of millions of people in Southeast Asia, including Vietnamese and Indonesian. The resulting models know far more about Big Ben and hamburgers than they do about regional cuisines and monuments, according to Holy Lovenia, a researcher at AI Singapore, the nation’s AI research program.
For certain languages, it might already be too late. Native American and lesser-spoken languages may disappear as artificial intelligence (AI) and the internet make English and other higher-resource languages more and more accessible to youth. As time goes on and technology advances, a growing number of people worldwide will likely live a significant portion of their lives online, including you if you are reading this. The user needs to communicate in the machine’s language for it to work.
Less widely used languages might by default be ignored by AI, the internet, and regular humans alike. This could lead to abandonment. It might take a few years until many languages become extinct if nothing is done about it, according to Adebara. As an undergraduate, she has already seen a decline in the use of the languages she learned. People are led to believe that their languages are worthless when they observe that they lack technology, books, or an orthography.
Her own research attempts to address that, using a language model that can read and write in hundreds of African languages. African language speakers compliment Adebara on her program, saying, “I saw my language in the technology you built; I wasn’t expecting to see it there.” They get incredibly delighted when they say, “I didn’t know that some technology would be able to understand some part of my language.” I am likewise excited about that.
According to a number of experts, the future of artificial intelligence and low-resource languages resides not just in technological progress but also in precisely these kinds of conversations: instead of blindly declaring that the world needs ChatGPT, native speakers should be asked what the technology can accomplish for them. Rather of the all-powerful chatbots being sold by IT titans, they might benefit from improved speech recognition in a local dialect or a tool that can read and digitize non-Roman script. Dossou informed that his goal is to create “a platform that is appropriate and proper to African languages and Africans, not trying to generalize as Big Tech does,” as opposed to depending on Meta or OpenAI. Future generations may be able to use and benefit from such initiatives by providing low-resource languages with a presence on the internet where there was previously very little.
There is a Fon Wikipedia now, but with only 1,300 entries, it is only a small portion of the English Wikipedia’s total. Dossou has developed artificial intelligence software that can identify names in African languages. After personally translating hundreds of proverbs from French to Fon, he made a survey asking individuals to share with him common Fon expressions and sentences. His mother has given him input on the translations, which has improved the AI software. He has been able to converse with her more effectively thanks to the French-Fon translator he constructed. “To have been able to communicate with her, I would have needed a machine-translation tool,” he stated. He’s starting to comprehend her now, without the help of a machine. Dossou is beginning to realize that his native language is Fon, not French, and that a person and their community should determine their original language rather than the internet or a piece of software.