A valuable form of AI training data is running out

Google DeepMind researchers have a suggestion for how to solve the AI data drought, and it may include your social security number.

The large language models that underpin AI require massive quantities of training data from websites, books, and other sources. When it comes to text, the quantity of data on the web that is deemed fair game for training AI models is being scraped faster than fresh data is being generated.

However, a significant amount of the data is not used because it is judged hazardous, incorrect, or includes personally identifying information.

In a recently released study, researchers from Google DeepMind say they have discovered a method to clean up this data and make it suitable for training. They say this might be a “powerful tool” for scaling up frontier models.

The concept is known as Generative Data Refinement, or GDR for short. The unusable data is rewritten using pretrained generative models, therefore purifying it for safe training. It’s unclear if Google uses this method for its Gemini models.

Many AI labs are leaving useful training data on the table because it’s mixed up with poor data, according to Minqi Jiang, one of the researchers who wrote the study and has since left the business to work for Meta. For instance, labs frequently throw away complete documents found online that contain information deemed worthless, such a phone number or an inaccurate fact.

“So, even if it was just one line that contained some personally identifying information, you basically lose all those tokens inside that document,” Jiang explained. Tokens are data units that AI processes to form words inside text.

One example of raw data provided by the authors is a person’s Social Security number or information that would soon become outdated (“the incoming CEO is…”). In such cases, the GDR would change or eliminate the numbers, disregard the data that may become outdated, and save the remaining data that could be used.

Only this month was the paper published, even though it was written over a year ago. An inquiry concerning whether the researcher’s work was being used in Google DeepMind’s AI models was not answered by a representative of the firm.

Labs may find the authors’ results useful as the usable well of data runs out. They reference a 2022 study that projected AI models might absorb all of the text produced by humans between 2026 and 2032. The amount of indexed online data was used to produce this forecast. This estimate was based on the amount of indexed online data, as measured by Common Crawl, a project that scrapes web pages on a continual basis and makes them freely available to AI labs.

For the GDR publication, the researchers conducted a proof-of-concept study by annotating over one million lines of code line by line with human expert labelers. They then compared the results to the GDR approach.

“It completely crushes the existing industry solutions being used for this kind of stuff,” Jiang stated.

The use of synthetic data—data produced by AI models for the goal of training themselves or other models—has been a focus of investigation among AI laboratories, but the authors asserted that their approach is superior. However, the quality of model output can be lowered by employing synthetic data, and in some situations, this might result in “model collapse.”

After comparing the GDR data with artificial data generated by an LLM, the authors found that their method produced a more useful dataset for AI model training.

They also stated that more testing may be undertaken on other complex categories of data that are regarded unsuitable, such as copyrighted content and personal data that is inferred across numerous documents rather than expressly specified.

Jiang stated that the document had not been peer reviewed, which is usual in the technology sector, and that all publications undergo review internally.

Source link