Home Artificial Intelligence Artificial Intelligence News Big Tech’s Race to buy AI training data

Big Tech’s Race to buy AI training data

April 6, 2024

During its height in the early 2000s, Photobucket held the top spot in the world for image hosting. With 70 million users and about half of the US online photo industry, it served as the media foundation for once-popular sites like Myspace and Friendster.

According to analytics tracker Similarweb, there are just 2 million users of Photobucket today. However, it might be given fresh life by the generative AI revolution.

CEO Ted Leonard, who runs the 40-strong company out of Edwards, Colorado, told that he is in talks with several tech companies about licensing Photobucket’s 13 billion photographs and videos to train generative AI models that can generate new material in response to text cues.

According to him, he has talked about costs ranging from 5 cents to $1 per photo and more than $1 for videos, with prices greatly depending on the customer and the kind of imagery desired.

“More than his platform has, we’ve spoken to companies that say, ‘we need way more,'” Leonard continued. One client specifically told him they required over a billion videos.

“You scratch your head and say, where do you get that?”

Citing commercial confidentiality, Photobucket declined to reveal the identities of its potential buyers. The continuing talks, which haven’t been made public before, indicate that the corporation may have access to content valued at billions of dollars and provide a window into the thriving data market that’s emerging as generative AI technology becomes more and more dominant.

Internet behemoths including Microsoft, Google, and Meta open new tabs In order to train generative AI models like ChatGPT, which can replicate human creativity, OpenAI first used massive amounts of free internet data gathered. Although a number of copyright holders have sued them over the practice, they maintain that doing so is morally and legally acceptable.

Concurrently, these digital giants are covertly funding information that is restricted by paywalls and login screens, resulting in an unreported market for anything from chat logs to long-forgotten private images from defunct social media apps.

“At the moment, there is a rush to pursue copyright holders who possess exclusive collections of content that cannot be scraped,” stated Edward Klaris of the legal firm Klaris Law. The firm claims to be advising content owners on agreements worth tens of millions of dollars each to license archives of images, videos, and books for artificial intelligence training.

After speaking with more than 30 people with knowledge of AI data deals, including current and former executives at the companies involved, lawyers, and consultants, to provide the first in-depth exploration of this fledgling market, detailing the types of content being purchased, the prices being charged, and emerging concerns about the risk of personal data being incorporated into AI models without people’s knowledge or explicit consent.

Relative to the extent of the opaque AI data market—where businesses sometimes withhold agreements—many large market research organizations claim they haven’t even started to estimate. The market is currently valued at about $2.5 billion, but those that conduct research—like Business Research Insights—project that it may reach up to $30 billion in the next ten years.

GENERATIVE DATA GOLD RUSH

The data grab coincides with growing demand on creators of large-scale generative AI “foundation” models to account for the vast volumes of content they input into their systems—a process known as “training” that necessitates intense processing power and frequently takes months to finish.

Tech businesses claim that without the ability to leverage massive archives of freely scraped web page data, such those made available by the non-profit repository Common Crawl, which they refer to as “publicly available,” the technology would be prohibitively expensive.

Despite this, their strategy has sparked a flood of copyright litigation and regulatory scrutiny, forcing publishers to incorporate technology to prevent scraping on their websites.

In response, developers of AI models have begun to manage risks and safeguard data supply chains by entering into agreements with content owners and by utilizing the growing number of data brokers that have emerged to meet demand.

For example, companies like Meta, Google, Amazon, and Apple reached agreements with stock image provider Shutterstock in the months following ChatGPT’s launch in late 2022 to use hundreds of millions of images, videos, and music files in its library for training, according to a person familiar with the agreements.

According to Shutterstock’s Chief Financial Officer Jarrod Yahes, the deals with Big Tech companies initially ranged from $25 million to $50 million per, however most of them were eventually enlarged. He said that in the last two months, there has been a new “flurry of activity” as smaller IT players have followed suit.

Yahes declined to discuss specific contracts. The terms of the other agreements, as well as the one with Apple, were not previously disclosed.

Competitor to Shutterstock, Freepik, said that it has reached deals with two major digital firms to license most of its 200 million-photo archive for two to four cents per image. According to CEO Joaquin Cuenca Abela, there are five additional acquisitions of a similar nature in the works. He did not reveal the buyers, though.

In addition, at least four news organizations—including The Associated Press and Axel Springer—have inked licencing agreements with OpenAI, an early Shutterstock customer. Owner of Reuters News, Thomson Reuters, said separately that it had agreements in place to license news information for the purpose of training AI large language models; however, it did not provide any specifics.

‘ETHICALLY SOURCED’ CONTENT

Along with securing the rights to real-world content such as podcasts, short-form videos, and interactions with digital assistants, a growing number of specialized AI data firms are creating networks of temporary contract workers who can create custom visuals and voice samples on-the-spot, creating a gig economy for data similar to Uber.

Based in Seattle According to CEO Daniela Braga, Defined.ai licenses data to a number of businesses, including Google, Meta, Apple, Amazon, and Microsoft (Reuters).

Braga stated that although prices differ depending on the buyer and type of content, businesses typically pay $1 to $2 for each photograph, $2 to $4 for short-form videos, and $100 to $300 for lengthier films per hour. She also mentioned the $0.001 per word market rate for text.

Nude photos, which need to be handled with extreme care, sell for $5 to $7, according to her.

According to Braga, Defined.ai divides such profits with content suppliers. According to her, the company presents its datasets as “ethically sourced,” meaning that personally identifying information is removed and agreement is obtained from the individuals whose data is used.

One of the company’s suppliers, a businessman from Brazil, claimed to pay 20% to 30% of the overall deal amounts to the proprietors of the images, podcasts, and medical data he sources.

The supplier, who spoke on condition that his company’s identity be withheld due to commercial sensitivity, claimed that the most expensive photographs in his portfolio are those that are used to train AI systems that filter content like graphic violence prohibited by tech companies.

In order to comply with such demands, he sources photos of crime scenes, violent conflicts, and surgical procedures primarily from law enforcement, independent photojournalists, and medical students, respectively. These photos are frequently taken in South America and Africa, where it is more common to distribute graphic photographs, he said.

He claimed to have received photos from independent photographers in Gaza since the conflict there began in October, along with some from Israel when hostilities first broke out.

He said, “The images are disturbing to untrained eyes, so his company hires nurses accustomed to seeing violent injuries to anonymize and annotate the images.”

‘I WOULD FIND IT RISKY’

Many of the industry players questioned said that while licensing could address some legal and ethical concerns, bringing back the archives of long-gone internet brands like Photobucket to feed the newest AI models presents other challenges, especially with regard to user privacy.

AI systems have been observed reproducing perfect replicas of their training material, spewing out text from New York Times stories, photographs of actual individuals, and the Getty Images watermark, to name a few examples. This opens a new tab. Thus, without notification or express authorization, a person’s personal images or intimate thoughts from decades ago may end up in generative AI outputs.

Citing an October modification to the terms of service that gives the company the “unrestricted right” to sell any submitted content for the purpose of training AI systems, Photobucket CEO Leonard argues that he is well inside the law. Rather of selling advertisements, he considers licensing data.

He stated, “We have bills to pay, and this might allow us to keep supporting free accounts.”

According to Braga of Defined.ai, she prefers to source social media photographs from influencers who take them since they have a stronger claim to licensing rights than “platform” firms like Photobucket.

Regarding platform content, Braga stated, “I would find it very risky.” There is an issue if an AI produces anything that looks like a photo of someone who never gave their consent.

There are more platforms besides Photobucket that provide licensing. Last month, Automattic, the parent company of Tumblr, said that it was sharing content with “selected AI companies.” According to data, Reddit and Google reached an agreement in February wherein Reddit’s content will be made available for use in training Google’s AI models.

Reddit revealed and opened a new tab ahead of its March IPO that the Federal Trade Commission is looking into its data-licensing business and that it might violate new laws pertaining to intellectual property and privacy.

In February, the FTC issued a warning to businesses about retrospectively altering terms of service for AI usage. The FTC declined to comment on the Reddit investigation or to say whether it was investigating other deals involving training data.

Source link