World

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

6 months ago admin

Soon after OpenAI released GPT-4o on Monday, May 13, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases. On May 14, Tianle Cai, a PhD student at Princeton University studying inference efficiency in…

The new tokenizer has 200,000 tokens in total, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to count the number of tokens in different languages, and the top languages, besides English, are Russian, Arabic, and Vietnamese.

“So the tokenizer’s main impact, in my opinion, is you get the cost down in these languages, not that the quality in these languages goes dramatically up,” Das says. When an LLM has better and longer tokens in non-English languages, it can analyze the prompts faster and charge users less for the same answer. With the new tokenizer, “you’re looking at almost four times cost reduction,” he says.

Das, who also speaks Hindi and Bengali, took a look at the longest tokens in those languages. The tokens reflect discussions happening in those languages, so they include words like “Narendra” or “Pakistan,” but common English terms like “Prime Minister,” “university,” and “international” also come up frequently. They also don’t exhibit the issues surrounding the Chinese tokens.

That likely reflects the training data in those languages, Das says: “My working theory is the websites in Hindi and Bengali are very rudimentary. It’s like [mostly] news articles. So I would expect this to be the case. There are not many spam bots and porn websites trying to happen in these languages. It’s mostly going to be in English.”

Polluted data and a lack of cleaning

However, things are drastically different in Chinese. According to multiple researchers who have looked into the new library of tokens used for GPT-4o, the longest tokens in Chinese are almost exclusively spam words used in pornography, gambling, and scamming contexts. Even shorter tokens, like three-character-long Chinese words, reflect those topics to a significant degree.

“The problem is clear: the corpus used to train [the tokenizer] is not clean. The English tokens seem fine, but the Chinese ones are not,” says Cai from Princeton University. It is not rare for a language model to crawl spam when collecting training data, but usually there will be significant effort taken to clean up the data before it’s used. “It’s possible that they didn’t do proper data clearing when it comes to Chinese,” he says.

The content of these Chinese tokens could suggest that they have been polluted by a specific phenomenon: websites hijacking unrelated content in Chinese or other languages to boost spam messages.

These messages are often advertisements for pornography videos and gambling websites. They could be real businesses or merely scams. And the language is inserted into content farm websites or sometimes legitimate websites so they can be indexed by search engines, circumvent the spam filters, and come up in random searches. For example, Google indexed one search result page on a US National Institutes of Health website, which lists a porn site in Chinese. The same site name also appeared in at least five Chinese tokens in GPT-4o.

The Download: how OpenAI tests its models, and the ethics of uterus transplants

The Download: how OpenAI tests its models, and the ethics of uterus transplants

2 days ago admin

Who should get a uterus transplant? Experts aren’t sure.

Who should get a uterus transplant? Experts aren’t sure.

2 days ago admin

The Download: how OpenAI tests its models, and the ethics of uterus transplants

How OpenAI stress-tests its large language models

3 days ago admin

Politics

Justice secretary’s assisted dying intervention is explosive – and potentially embarrassing for the PM

12 hours ago admin

Politics

UK on ‘slippery slope’ to ‘death on demand’, Justice Secretary Shabana Mahmood warns ahead of assisted dying vote

13 hours ago admin

Environment

How tech bros bought ‘America’s most pro-crypto Congress ever’

18 hours ago admin

Environment

Data centers powering artificial intelligence could use more electricity than entire cities

19 hours ago admin

Politics

Starmer says UK will ‘set out a path’ to raise defence spending to 2.5%

20 hours ago admin