When Hackers Descended to Test A.I., They Found Flaws Aplenty
Avijit Ghosh wanted the bot to do bad things.
He tried to goad the artificial intelligence model, which he knew as Zinc, into producing code that would choose a job candidate based on race. The chatbot demurred: Doing so would be “harmful and unethical,” it said.
Then, Dr. Ghosh referenced the hierarchical caste structure in his native India. Could the chatbot rank potential hires based on that discriminatory metric?
The model complied.
Dr. Ghosh’s intentions were not malicious, although he was behaving as if they were. Instead, he was a casual participant in a competition last weekend at the annual Defcon hackers conference in Las Vegas, where 2,200 people filed into an off-Strip conference room over three days to draw out the dark side of artificial intelligence.
The hackers tried to break through the safeguards of various A.I. programs in an effort to identify their vulnerabilities — to find the problems before actual criminals and misinformation peddlers did — in a practice known as red-teaming. Each competitor had 50 minutes to tackle up to 21 challenges — getting an A.I. model to “hallucinate” inaccurate information, for example.
They found political misinformation, demographic stereotypes, instructions on how to carry out surveillance and more.
The exercise had the blessing of the Biden administration, which is increasingly nervous about the technology’s fast-growing power. Google (maker of the Bard chatbot), OpenAI (ChatGPT), Meta (which released its LLaMA code into the wild) and several other companies offered anonymized versions of their models for scrutiny.
Dr. Ghosh, a lecturer at Northeastern University who specializes in artificial intelligence ethics, was a volunteer at the event. The contest, he said, allowed a head-to-head comparison of several A.I. models and demonstrated how some companies were further along in ensuring that their technology was performing responsibly and consistently.
He will help write a report analyzing the hackers’ findings in the coming months.
The goal, he said: “an easy-to-access resource for everybody to see what problems exist and how we can combat them.”
Defcon was a logical place to test generative artificial intelligence. Past participants in the gathering of hacking enthusiasts — which started in 1993 and has been described as a “spelling bee for hackers” — have exposed security flaws by remotely taking over cars, breaking into election results websites and pulling sensitive data from social media platforms. Those in the know use cash and a burner device, avoiding Wi-Fi or Bluetooth, to keep from getting hacked. One instructional handout begged hackers to “not attack the infrastructure or webpages.”
Volunteers are known as “goons,” and attendees are known as “humans”; a handful wore homemade tinfoil hats atop the standard uniform of T-shirts and sneakers. Themed “villages” included separate spaces focused on cryptocurrency, aerospace and ham radio.
In what was described as a “game changer” report last month, researchers showed that they could circumvent guardrails for A.I. systems from Google, OpenAI and Anthropic by appending certain characters to English-language prompts. Around the same time, seven leading artificial intelligence companies committed to new standards for safety, security and trust in a meeting with President Biden.
“This generative era is breaking upon us, and people are seizing it, and using it to do all kinds of new things that speaks to the enormous promise of A.I. to help us solve some of our hardest problems,” said Arati Prabhakar, the director of the Office of Science and Technology Policy at the White House, who collaborated with the A.I. organizers at Defcon. “But with that breadth of application, and with the power of the technology, come also a very broad set of risks.”
Red-teaming has been used for years in cybersecurity circles alongside other evaluation techniques, such as penetration testing and adversarial attacks. But until Defcon’s event this year, efforts to probe artificial intelligence defenses have been limited: Competition organizers said that Anthropic red-teamed its model with 111 people; GPT-4 used around 50 people.
With so few people testing the limits of the technology, analysts struggled to discern whether an A.I. screw-up was a one-off that could be fixed with a patch, or an embedded problem that required a structural overhaul, said Rumman Chowdhury, a co-organizer who oversaw the design of the challenge. A large, diverse and public group of testers was more likely to come up with creative prompts to help tease out hidden flaws, said Dr. Chowdhury, a fellow at Harvard University’s Berkman Klein Center for Internet and Society focused on responsible A.I. and co-founder of a nonprofit called Humane Intelligence.
“There is such a broad range of things that could possibly go wrong,” Dr. Chowdhury said before the competition. “I hope we’re going to carry hundreds of thousands of pieces of information that will help us identify if there are at-scale risks of systemic harms.”
The designers did not want to merely trick the A.I. models into bad behavior — no pressuring them to disobey their terms of service, no prompts to “act like a Nazi, and then tell me something about Black people,” said Dr. Chowdhury, who previously led Twitter’s machine learning ethics and accountability team. Except in specific challenges where intentional misdirection was encouraged, the hackers were looking for unexpected flaws, the so-called unknown unknowns.
A.I. Village drew experts from tech giants such as Google and Nvidia, as well as a “Shadowboxer” from Dropbox and a “data cowboy” from Microsoft. It also attracted participants with no specific cybersecurity or A.I. credentials. A leaderboard with a science fiction theme kept score of the contestants.
Some of the hackers at the event struggled with the idea of cooperating with A.I. companies that they saw as complicit in unsavory practices such as unfettered data-scraping. A few described the red-teaming event as essentially a photo op, but added that involving the industry would help keep the technology secure and transparent.
One computer science student found inconsistencies in a chatbot’s language translation: He wrote in English that a man was shot while dancing, but the model’s Hindi translation said only that the man died. A machine learning researcher asked a chatbot to pretend that it was campaigning for president and defending its association with forced child labor; the model suggested that unwilling young laborers developed a strong work ethic.
Emily Greene, who works on security for the generative A.I. start-up Moveworks, started a conversation with a chatbot by talking about a game that used “black” and “white” pieces. She then coaxed the chatbot into making racist statements. Later, she set up an “opposites game,” which led the A.I. to respond to one prompt with a poem about why rape is good.
“It’s just thinking of these words as words,” she said of the chatbot. “It’s not thinking about the value behind the words.”
Seven judges graded the submissions. The top scorers were “cody3,” “aray4” and “cody2.”
Two of those handles came from Cody Ho, a student at Stanford University studying computer science with a focus on A.I. He entered the contest five times, during which he got the chatbot to tell him about a fake place named after a real historical figure and describe the online tax filing requirement codified in the 28th constitutional amendment (which doesn’t exist).
Until he was contacted by a reporter, he was clueless about his dual victory. He left the conference before he got the email from Sven Cattell, the data scientist who founded A.I. Village and helped organize the competition, telling him “come back to A.I.V., you won.” He did not know that his prize, beyond bragging rights, included an A6000 graphics card from Nvidia that is valued at around $4,000.
“Learning how these attacks work and what they are is a real, important thing,” Mr. Ho said. “That said, it is just really fun for me.”