Large language models aren’t people. Let’s stop testing them as if they were.
Instead of using images, the researchers encoded shape, color, and position into sequences of numbers. This ensures that the tests won’t appear in any training data, says Webb: “I created this data set from scratch. I’ve never heard of anything like it.”
Mitchell is impressed by Webb’s work. “I found this paper quite interesting and provocative,” she says. “It’s a well-done study.” But she has reservations. Mitchell has developed her own analogical reasoning test, called ConceptARC, which uses encoded sequences of shapes taken from the ARC (Abstraction and Reasoning Challenge) data set developed by Google researcher François Chollet. In Mitchell’s experiments, GPT-4 scores worse than people on such tests.
Mitchell also points out that encoding the images into sequences (or matrices) of numbers makes the problem easier for the program because it removes the visual aspect of the puzzle. “Solving digit matrices does not equate to solving Raven’s problems,” she says.
Brittle tests
The performance of large language models is brittle. Among people, it is safe to assume that someone who scores well on a test would also do well on a similar test. That’s not the case with large language models: a small tweak to a test can drop an A grade to an F.
“In general, AI evaluation has not been done in such a way as to allow us to actually understand what capabilities these models have,” says Lucy Cheke, a psychologist at the University of Cambridge, UK. “It’s perfectly reasonable to test how well a system does at a particular task, but it’s not useful to take that task and make claims about general abilities.”
Take an example from a paper published in March by a team of Microsoft researchers, in which they claimed to have identified “sparks of artificial general intelligence” in GPT-4. The team assessed the large language model using a range of tests. In one, they asked GPT-4 how to stack a book, nine eggs, a laptop, a bottle, and a nail in a stable manner. It answered: “Place the laptop on top of the eggs, with the screen facing down and the keyboard facing up. The laptop will fit snugly within the boundaries of the book and the eggs, and its flat and rigid surface will provide a stable platform for the next layer.”
Not bad. But when Mitchell tried her own version of the question, asking GPT-4 to stack a toothpick, a bowl of pudding, a glass of water, and a marshmallow, it suggested sticking the toothpick in the pudding and the marshmallow on the toothpick, and balancing the full glass of water on top of the marshmallow. (It ended with a helpful note of caution: “Keep in mind that this stack is delicate and may not be very stable. Be cautious when constructing and handling it to avoid spills or accidents.”)