Future AI Models May Be Affected by AI-Generated Data

Programmes that can generate text, computer code, graphics, and music are easily accessible to the average person because to a rise in generative artificial intelligence. And we already employ them: The Internet is being overrun by AI material, and writing produced by “large language models” now makes up thousands of websites, including CNET and Gizmodo. However,

as AI researchers scour the web, AI-generated content might soon appear in the data sets used to train new models to behave like people. Some professionals believe that will unintentionally add errors that accumulate with each new generation of models.

This idea is supported by a growing body of evidence. It implies that an AI-generated text training diet, even in modest amounts, eventually turns out to be “poisonous” to the model being trained. There are currently not many obvious cures. According to Rik Sarkar, a computer scientist from the School of Informatics at the University of Edinburgh in Scotland, it might not be a problem right now or in, say, a few months, but he believes it will come into play in a few years.

A specific 20th-century conundrum may be somewhat comparable to the danger of AI models becoming corrupted. Decades of nuclear testing added a splash of radioactive fallout to Earth’s atmosphere after the first atomic bombs were set off at the close of World War II. That air carried higher radiation into the freshly manufactured steel. It won’t do for a Geiger counter to flag itself, which is an issue for particularly radiation-sensitive steel applications like Geiger counter consoles. There was a demand for the limited supply of low-radiation metal as a result. Scraps of prewar steel were excavated from historic ships by scavengers. Now, some insiders think that generative AI is about to undergo a cycle similar to that of steel, but with training data in place of steel.

The poisoning of AI can be observed by researchers. Take a language model that was trained on data that was generated by people as an example. Use the model to produce some AI results. Use the resulting output to train a third version of the model, and so on. Then use that output to train a new instance of the model. Errors compound onto one another with each iteration. The tenth model, when asked to write about old English buildings, babbles about jackrabbits.

According to Ilia Shumailov, a researcher in machine learning at the University of Oxford, there comes a moment where your model is essentially useless.

This process is referred to by Shumailov and his associates as “model collapse.” They noticed it in the language model OPT-125m, another AI model that creates numbers that appear to be handwritten, and even a straightforward model that seeks to distinguish between two probability distributions. Even in the most basic models, Shumailov claims that it has already occurred. He guarantee it’s 100% already happening in more intricate models as well.

An analogous experiment using a diffusion model, a form of AI picture generator, was carried out by Sarkar and his colleagues in Madrid and Edinburgh in a recent preprint research. They were able to generate recognizable flowers or birds using their initial model in this series. By their third model, those images had become hazy.

Other experiments revealed that the training data set was harmful, even when it was partially generated by AI, according to Sarkar. It becomes problematic as long as a sizeable portion is produced by AI, according to him. It is still being researched just how much AI-generated content is required to disrupt which kinds of models.

Compared to models like the language model GPT-4 or the image generator Stable Diffusion, both groups’ experiments used more modest models that are smaller and require fewer training examples. Larger models might be more resilient to model collapse, but experts say there isn’t much evidence to support this.

The data elements at the “tails” of a model’s data—those that are less commonly represented in the training set—are where it will suffer the most, according to the study so far. Due to the fact that these tails contain data that deviates even further from the “norm,” a model collapse could result in the AI’s output losing the diversity that experts believe distinguishes human data from other types of data. Shumailov is particularly concerned that this will increase the biases that models already have against disadvantaged groups. He asserts that it is obvious that biased modelling will increase in the future. To reduce it, deliberate effort must be made.

All of this may be speculative, but AI-generated content is already making inroads into industries that machine-learning specialists use as training material. Consider language models: even traditional news sources have started publishing AI-generated content, and some Wikipedia editors want to employ language models to create content for the site.

According to Veniamin Veselovskyy, a PhD student at the Swiss Federal Institute of Technology in Lausanne (EPFL), we seem to be at a turning point when many of the tools we now use to train these models are rapidly getting saturated with synthetic text.

There are red flags that suggest data produced by AI may be used for model training from other sources as well. In order to annotate the training data for their models or to evaluate the results, machine learning engineers have long used platforms for crowd work, such as Amazon Mechanical Turk. In order to summarize medical research papers, Veselovskyy and his EPFL colleagues used Mechanical Turk workers. They discovered that ChatGPT was present in almost one-third of the summaries.

The research from the EPFL group, which was published on the public server arXiv.org last month, only looked at 46 responses from Mechanical Turk users, despite the fact that summarizing is a traditional language model task. But the outcome has created a ghost in the machine learning engineers’ heads. Manoel Horta Ribeiro, an EPFL graduate student, claims that ChatGPT makes it much simpler to annotate textual material and that the results are excellent. Researchers like Veselovskyy and Ribeiro have started thinking about strategies to safeguard the humanity of crowdsourced data, such as changing Mechanical Turk to discourage users from using language models and rethinking studies to encourage more human data.

What can a helpless machine-learning engineer do in the face of the possibility of model collapse? Data that is known to be free (or possibly as free as it is conceivable to be) from the influence of generative AI could be the equivalent of prewar steel in a Geiger counter. Sarkar, for instance, proposes the usage of “standardized” image data sets that would be managed by people who are aware that its content exclusively consists of human inventions and that would be freely usable by developers.

Some engineers might be tempted to get into the Internet Archive and search for material from before the AI boom, but Shumailov does not believe that doing so will provide a solution. He believes there may not be enough historical data to meet the demands of expanding models, to name one issue. For another, such information is only historical and not always indicative of a dynamic environment.

According to Shumailov, it would be impossible to predict the news of today by compiling the news from the previous 100 years because technology has advanced. The language has evolved. The way that the problems are understood has evolved.

Therefore, separating human-generated data from synthetic information and filtering out the latter may present a more straightforward problem. But even if the technology existed, it would be a challenging task. Is the outcome an AI-generated image—or not? Sarkar asks in a world where Adobe Photoshop users can modify photographs using generative AI.

Source link