Stack Overflow Will Charge AI Giants for Training Data

Share

A potential roadmap to pricing could come from Elon Musk, who this month hiked prices for access to Twitter data. They start at $42,000 per month for access to 50 million tweets. About three times the volume of tweets had been previously available for free. In a tweet this week, Musk accused Microsoft, a major AI developer and close partner of OpenAI, of training algorithms “illegally using Twitter data.” Without elaboration, he added, “Lawsuit time.”

Both Stack Overflow and Reddit will continue to license data for free to some people and companies. Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes. “When people start charging for products that are built on community-built sites like ours, that’s where it’s not fair use,” he says.

Reddit CEO Steve Huffman told The New York Times this week that he didn’t want to give a freebie to the world’s largest companies. “Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with,” he said.

As expectations surge that ChatGPT-style bots and other products built on LLMs will reap huge profits, other companies with stocks of content needed to train machine learning algorithms also want to be paid. Some news publishers have been wary of how Microsoft’s new Bing chatbot handles their content.  

But so far only a few public deals over access to training data have been announced, such as photo bank Shutterstock agreeing to license content to OpenAI. Its rival Getty Images is suing Stability AI, an OpenAI competitor, for not seeking a license before allegedly using over 12 million photos. The AI startup’s response is due in US federal court next week.

AI developers are not under all-out pressure to pay yet. Some companies with large volumes of academic text or casual conversations say they have no plans to start charging for their APIs or similar data portals. PLOS, a publisher of scientific research whose content has been leveraged in AI training, is “not likely” to change its fairly unrestrictive terms of use, spokesperson David Knutson says. Online community platform Discord has no plans to modify its API offerings, which are free and provided under terms that forbid AI training, says spokesperson Swaleha Carlson.

At Stack Overflow, charging for its API is just one part of a broader AI strategy that the company expects to unveil in a few months. About 10 percent of Stack Overflow’s nearly 600 staff are focused on the initiative, which includes developing its own generative AI services. For example, an assistant function could help guide people as they compose questions to post.

To date, the Stack Overflow community’s primary action has been to ban users from posting AI-generated responses. Chandrasekar says a spike in inaccurate answers following the release of ChatGPT had created a challenge for the company’s several hundred or so moderators.

Launched in 2008, Stack Overflow generates about equal parts of its revenue from selling ads and licensing Q&A software as a subscription to more than 1,200 organizations for internal use. The company’s sales grew 33 percent to $45 million during the six months ended September 30, 2022, the most recent data available, compared with the year-earlier period. About 200,000 new users registered on average each month during that span.

Those users could reasonably clamor for their own compensation if Stack Overflow succeeds in licensing to AI makers the questions and answers they write for free. Chandrasekar says, “There’s absolutely thought going into how best to make sure that our community members and the people that make the site what it is today—how we are going to take care of them in the context of what’s happening here.”