China has a new plan for judging the safety of generative AI—and it’s packed with details

1 year ago admin

This story first appeared in China Report, MIT Technology Review’s newsletter about technology in China. Sign up to receive it in your inbox every Tuesday. Ever since the Chinese government passed a law on generative AI back in July, I’ve been wondering how exactly China’s censorship machine would adapt for the AI era. The content produced by…

Last week we got some clarity about what all this may look like in practice.

On October 11, a Chinese government organization called the National Information Security Standardization Technical Committee released a draft document that proposed detailed rules for how to determine whether a generative AI model is problematic. Often abbreviated as TC260, the committee consults corporate representatives, academics, and regulators to set up tech industry rules on issues ranging from cybersecurity to privacy to IT infrastructure.

Unlike many manifestos you may have seen about how to regulate AI, this standards document is very detailed: it sets clear criteria for when a data source should be banned from training generative AI, and it gives metrics on the exact number of keywords and sample questions that should be prepared to test out a model.

Matt Sheehan, a global technology fellow at the Carnegie Endowment for International Peace who flagged the document for me, said that when he first read it, he “felt like it was the most grounded and specific document related to the generative AI regulation.” He added, “This essentially gives companies a rubric or a playbook for how to comply with the generative AI regulations that have a lot of vague requirements.”

It also clarifies what companies should consider a “safety risk” in AI models—since Beijing is trying to get rid of both universal concerns, like algorithmic biases, and content that’s only sensitive in the Chinese context. “It’s an adaptation to the already very sophisticated censorship infrastructure,” he says.

So what do these specific rules look like?

On training: All AI foundation models are currently trained on many corpora (text and image databases), some of which have biases and unmoderated content. The TC260 standards demand that companies not only diversify the corpora (mixing languages and formats) but also assess the quality of all their training materials.

How? Companies should randomly sample 4,000 “pieces of data” from one source. If over 5% of the data is considered “illegal and negative information,” this corpus should be blacklisted for future training.

China has a new plan for judging the safety of generative AI—and it’s packed with details

The Download: how OpenAI tests its models, and the ethics of uterus transplants

Who should get a uterus transplant? Experts aren’t sure.

How OpenAI stress-tests its large language models

You may have missed

Justice secretary’s assisted dying intervention is explosive – and potentially embarrassing for the PM

UK on ‘slippery slope’ to ‘death on demand’, Justice Secretary Shabana Mahmood warns ahead of assisted dying vote

How tech bros bought ‘America’s most pro-crypto Congress ever’

Data centers powering artificial intelligence could use more electricity than entire cities

Starmer says UK will ‘set out a path’ to raise defence spending to 2.5%

Categories

Useful Links

More Stories

You may have missed

Categories

Useful Links