‘Not for Machines to Harvest’: Data Revolts Break Out Against A.I.
For more than 20 years, Kit Loffstadt has written fan fiction exploring alternate universes for “Star Wars” heroes and “Buffy the Vampire Slayer” villains, sharing her stories free online.
But in May, Ms. Loffstadt stopped posting her creations after she learned that a data company had copied her stories and fed them into the artificial intelligence technology underlying ChatGPT, the viral chatbot. Dismayed, she hid her writing behind a locked account.
Ms. Loffstadt also helped organize an act of rebellion last month against A.I. systems. Along with dozens of other fan fiction writers, she published a flood of irreverent stories online to overwhelm and confuse the data-collection services that feed writers’ work into A.I. technology.
“We each have to do whatever we can to show them the output of our creativity is not for machines to harvest as they like,” said Ms. Loffstadt, a 42-year-old voice actor from South Yorkshire in Britain.
Fan fiction writers are just one group now staging revolts against A.I. systems as a fever over the technology has gripped Silicon Valley and the world. In recent months, social media companies such as Reddit and Twitter, news organizations including The New York Times and NBC News, authors such as Paul Tremblay and the actress Sarah Silverman have all taken a position against A.I. sucking up their data without permission.
Their protests have taken different forms. Writers and artists are locking their files to protect their work or are boycotting certain websites that publish A.I.-generated content, while companies like Reddit want to charge for access to their data. At least 10 lawsuits have been filed this year against A.I. companies, accusing them of training their systems on artists’ creative work without consent. This past week, Ms. Silverman and the authors Christopher Golden and Richard Kadrey sued OpenAI, the maker of ChatGPT, and others over A.I.’s use of their work.
At the heart of the rebellions is a newfound understanding that online information — stories, artwork, news articles, message board posts and photos — may have significant untapped value.
The new wave of A.I. — known as “generative A.I.” for the text, images and other content it generates — is built atop complex systems such as large language models, which are capable of producing humanlike prose. These models are trained on hoards of all kinds of data so they can answer people’s questions, mimic writing styles or churn out comedy and poetry.
That has set off a hunt by tech companies for even more data to feed their A.I. systems. Google, Meta and OpenAI have essentially used information from all over the internet, including large databases of fan fiction, troves of news articles and collections of books, much of which was available free online. In tech industry parlance, this was known as “scraping” the internet.
OpenAI’s GPT-3, an A.I. system released in 2020, spans 500 billion “tokens,” each representing parts of words found mostly online. Some A.I. models span more than one trillion tokens.
The practice of scraping the internet is longstanding and was largely disclosed by the companies and nonprofit organizations that did it. But it was not well understood or seen as especially problematic by the companies that owned the data. That changed after ChatGPT debuted in November and the public learned more about underlying A.I. models that powered the chatbots.
“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, the founder and chief executive of Nomic, an A.I. company. “Previously, the thought was that you got value from data by making it open to everyone and running ads. Now, the thought is that you lock your data up, because you can extract much more value when you use it as an input to your A.I.”
The data protests may have little effect in the long run. Deep-pocketed tech giants like Google and Microsoft already sit on mountains of proprietary information and have the resources to license more. But as the era of easy-to-scrape content comes to a close, smaller A.I. upstarts and nonprofits that had hoped to compete with the big firms might not be able to obtain enough content to train their systems.
In a statement, OpenAI said ChatGPT was trained on “licensed content, publicly available content and content created by human A.I. trainers.” It added, “We respect the rights of creators and authors, and look forward to continuing to work with them to protect their interests.”
Google said in a statement that it was involved in talks on how publishers could manage their content in the future. “We believe everyone benefits from a vibrant content ecosystem,” the company said. Microsoft did not respond to a request for comment.
The data revolts erupted last year after ChatGPT became a worldwide phenomenon. In November, a group of programmers filed a proposed class action lawsuit against Microsoft and OpenAI, claiming the companies had violated their copyright after their code was used to train an A.I.-powered programming assistant.
In January, Getty Images, which provides stock photos and videos, sued Stability A.I., an A.I. company that creates images out of text descriptions, claiming the start-up had used copyrighted photos to train its systems.
Then in June, Clarkson, a law firm in Los Angeles, filed a 151-page proposed class action suit against OpenAI and Microsoft, describing how OpenAI had gathered data from minors and said web scraping violated copyright law and constituted “theft.” On Tuesday, the firm filed a similar suit against Google.
“The data rebellion that we’re seeing across the country is society’s way of pushing back against this idea that Big Tech is simply entitled to take any and all information from any source whatsoever, and make it their own,” said Ryan Clarkson, the founder of Clarkson.
Eric Goldman, a professor at Santa Clara University School of Law, said the lawsuit’s arguments were expansive and unlikely to be accepted by the court. But the wave of litigation is just beginning, he said, with a “second and third wave” coming that would define A.I.’s future.
Larger companies are also pushing back against A.I. scrapers. In April, Reddit said it wanted to charge for access to its application programming interface, or A.P.I., the method through which third parties can download and analyze the social network’s vast database of person-to-person conversations.
Steve Huffman, Reddit’s chief executive, said at the time that his company didn’t “need to give all of that value to some of the largest companies in the world for free.”
That same month, Stack Overflow, a question-and-answer site for computer programmers, said it would also ask A.I. companies to pay for data. The site has nearly 60 million questions and answers. Its move was earlier reported by Wired.
News organizations are also resisting A.I. systems. In an internal memo about the use of generative A.I. in June, The Times said A.I. companies should “respect our intellectual property.” A Times spokesman declined to elaborate.
For individual artists and writers, fighting back against A.I. systems has meant rethinking where they publish.
Nicholas Kole, 35, an illustrator in Vancouver, British Columbia, was alarmed by how his distinct art style could be replicated by an A.I. system and suspected the technology had scraped his work. He plans to keep posting his creations to Instagram, Twitter and other social media sites to attract clients, but he has stopped publishing on sites like ArtStation that post A.I.-generated content alongside human-generated content.
“It just feels like wanton theft from me and other artists,” Mr. Kole said. “It puts a pit of existential dread in my stomach.”
At Archive of Our Own, a fan fiction database with more than 11 million stories, writers have increasingly pressured the site to ban data-scraping and A.I.-generated stories.
In May, when some Twitter accounts shared examples of ChatGPT mimicking the style of popular fan fiction posted on Archive of Our Own, dozens of writers rose up in arms. They blocked their stories and wrote subversive content to mislead the A.I. scrapers. They also pushed Archive of Our Own’s leaders to stop allowing A.I.-generated content.
Betsy Rosenblatt, who provides legal advice to Archive of Our Own and is a professor at University of Tulsa College of Law, said the site had a policy of “maximum inclusivity” and did not want to be in the position of discerning which stories were written with A.I.
For Ms. Loffstadt, the fan fiction writer, the fight against A.I. came as she was writing a story about “Horizon Zero Dawn,” a video game where humans fight A.I.-powered robots in a postapocalyptic world. In the game, she said, some of the robots were good and others were bad.
But in the real world, she said, “thanks to hubris and corporate greed, they are being twisted to do bad things.”