Home Data Engineering Data News Stack Overflow Plans to Charge AI Giants for Training Data

Stack Overflow Plans to Charge AI Giants for Training Data

April 21, 2023

It costs hundreds of millions of dollars to develop the AI systems that power products like ChatGPT and the Dall-E picture generator, and the cost is only going up.

Major-scale AI projects like OpenAI, Google, and other businesses have often obtained a major portion of their training data for free by web scraping. But according to CEO Prashanth Chandrasekar, the well-known online community Stack Overflow, which offers assistance with computer programming, aims to start charging major AI developers for access to the 50 million questions and answers on its service as early as the middle of this year. More than 20 million users have registered on the website.

It has not been previously disclosed that Stack Overflow has decided to demand payment from businesses using its data as part of a larger generative AI plan. It comes after Reddit said this week that starting in June, it would start charging some AI developers to access its own material.

Not just the two community websites want a piece of the pie. The News/Media Alliance, a US trade association of publishers that includes Condé Nast, the owner of WIRED, released a set of principles today urging generative AI developers to discuss any usage of their data for training and other reasons and to respect their right to fair compensation.

According to independent evaluations and their own disclosures, Meta, Google, and OpenAI—the company that created ChatGPT—all created AI systems utiliizng data sets that collected information from a variety of online sources, including Stack Overflow and Reddit. AI text generators or chatbots can become more proficient and informed by feeding text from online chitchat or intelligent debates about programming into machine learning algorithms known as large language models, or LLMs. One of the major prospects for the technology is the use of LLMs to create computer code; Microsoft charges as much as $19 per user per month for their code generator GitHub Copilot.

According to Chandrasekar of Stack Overflow, community platforms that support LLMs should unquestionably be compensated for their efforts so that businesses like theirs may continue to invest in these communities and help them flourish. Their stance on Reddit’s strategy is quite positive.

To ensure that Stack Overflow can continue to draw users and retain high-quality content, Chandrasekar stated that the opportunity for more revenue is essential. Future chatbots would benefit from this, he claims, since they need to be “trained on something that is progressing knowledge forward. New knowledge development is required for them. Fencing off vital data, however, might potentially discourage some AI research and hinder the development of LLMs, both of which pose a threat to any service that people use for communication and information. The creation of high-quality LLMs will only progress faster with proper licensing, according to Chandrasekar.

Each and every AI developer aspires to reduce the significant expenses associated with creating complex AI systems, which need a huge investment in expensive computing infrastructure. It can take longer for developing technologies to become profitable if they have to pay for data they previously obtained for free. A request for comment from OpenAI went unanswered, and neither Meta nor Google had any right away.

Large language models can create text strings using word patterns they’ve learnt from books, websites, and other text sources used as training data. In addition to ChatGPT, the programmes are the brains of search chatbots like Microsoft Bing Chat and Google’s Bard. They also power an increasing number of applications that generate polished and original material in a matter of seconds. Patterns from image databases, including pictures obtained from Pinterest and Flickr, are used by its counterparts that produce AI-composed graphics and movies.

A lot of times, data sets utilized in AI development are created by shady methods like deploying software that extracts content from websites. Although website terms of use prohibiting the practice and copyright issues have caused it to be disputed, this is widely thought to be lawful in the US.

Some websites, such Reddit and Stack Overflow, have been more approachable. To make it easier for software to access their content known as APIs, they provide downloadable “data dumps” or real-time data portals. According to Chandrasekar, LLM developers at Stack Overflow are accessing data through a combination of APIs, scraping, and dumping, all of which are now possible for no cost.

However, Chandrasekar claims that LLM developers are going against Stack Overflow’s terms of service. As stated in Stack Overflow’s TOS, users own the content they upload, but it is all covered under a Creative Commons license that compels anyone using the information to cite its source. According to Chandrasekar, AI businesses violate the Creative Commons license when they sell their models to clients without properly crediting all of the community members whose queries and responses were utilized to train the model.

Reddit and Stack Overflow have not disclosed their prices. Reddit spokesperson Tim Rathschmidt adds, “We’re working on that as we speak, and will share more with partners in the coming weeks. According to Chandrasekar, Stack Overflow will research Reddit’s business model and talk with some of its prospective clients who have already inquired about data access.

Elon Musk, who increased charges last month for access to Twitter data, may provide an example for pricing. Access to 50 million tweets starts at $42,000 a month. Before, about three times as many tweets were freely accessible. Musk accused Microsoft, a significant AI developer and a close ally of OpenAI, in a tweet this week of “illegally using Twitter data” to train algorithms. He said, “Lawsuit time,” without giving further details.

Reddit CEO Steve Huffman said this week that he didn’t want to offer the biggest firms in the world a free ride. There is a problem with crawling Reddit, creating value, and not giving any of that value back to its users, he stated.

Other businesses who have stockpiles of content required to train machine learning algorithms want to be paid as the expectation soar that ChatGPT-style bots and other products based on LLMs would generate enormous revenues. The handling of their content by Microsoft’s new Bing chatbot has some news publishers on alert.

However, there have only been a few publicly disclosed agreements regarding access to training data, such as OpenAI’s agreement to license content from photo bank Shutterstock. Stability AI, a competitor of OpenAI, is being sued by rival Getty Images for allegedly exploiting more than 12 million images without first obtaining permission. Next Tuesday, the AI startup must respond in US federal court.

There is currently no severe financial pressure on AI engineers. Some businesses that store substantial amounts of academic material or banal conversations claim they don’t have any plans to start charging for their APIs or related data portals. According to spokesman David Knutson, PLOS, a publisher of scientific research whose information has been used in AI training, is “not likely” to change its relatively open conditions of use. Discord, an online community platform, has no intentions to change its free API services, which are given under conditions that prohibit AI training, says spokesperson Swaleha Carlson.

One element of Stack Overflow’s larger AI strategy, which the business anticipates announcing in a few months, is charging for its API. Stack Overflow has roughly 600 employees, and about 10% of them are working on the programme, which entails creating its own generative AI services. An assistant function, for instance, might assist users in providing guidance as they create queries to post.

The main course of action taken by the Stack Overflow community so far has been to forbid members from uploading comments that were generated by AI. Following the launch of ChatGPT, a jump in incorrect responses, according to Chandrasekar, presented a difficulty for the company’s several hundred or so moderators.

Stack Overflow, a company that was founded in 2008, makes nearly equal amounts of money from selling advertisements and subscribing-based Q&A software to more than 1,200 businesses for internal usage. The most recent data available shows that, compared to the same period last year, the company’s sales increased by 33% to $45 million during the six months ending September 30, 2022. During that time, an average of 200,000 new members signed up per month.

If Stack Overflow is successful in granting licenses to AI developers for the questions and answers those users create without charge, they may legitimately demand their own pay. As Chandrasekar puts it, There’s definitely thought going into how we’re going to take care of our community members and the people who make the site what it is today in the context of what’s happening here.

Source link