The New ChatGPT Can ‘See’ and ‘Talk.’ Here’s What It’s Like.
ChatGPT — viral artificial intelligence sensation, slayer of boring office work, sworn enemy of high school teachers and Hollywood screenwriters alike — is getting some new powers.
On Monday, ChatGPT’s maker, OpenAI, announced that it was giving the popular chatbot the ability to “see, hear and speak” with two new features.
The first is an update that allows ChatGPT to analyze and respond to images. You can upload a photo of a bike, for example, and receive instructions about how to lower the seat, or get recipe suggestions based on a photo of the contents of your refrigerator.
The second is a feature that allows users to speak to ChatGPT and get responses delivered in a synthetic A.I. voice, the way you might talk with Siri or Alexa.
These features are part of an industrywide push toward so-called multimodal A.I. systems that can handle text, photos, videos and whatever else a user might decide to throw at them. The ultimate goal, according to some researchers, is to create an A.I. capable of processing information in all the ways a human can.
Most users don’t have access to the new features yet. OpenAI is offering them first to paying ChatGPT Plus and Enterprise customers over the next few weeks, and will make them more widely available after that. (The vision feature will work on both desktop and mobile, while the speech feature will be available only through ChatGPT’s iOS and Android apps.)
I got early access to the new ChatGPT for a hands-on test. Here’s what I found.
The A.I. Will See You Now
I started by trying ChatGPT’s image-recognition feature on some household objects.
“What’s this thing I found in my junk drawer?” I asked, after uploading a photo of a mysterious piece of blue silicone with five holes in it.
“The object appears to be a silicone holder or grip, often used for holding multiple items together,” ChatGPT responded. (Close enough — it’s a finger strengthener I used years ago while recovering from a hand injury.)
I then fed ChatGPT a few photos of items I had been meaning to sell on Facebook Marketplace, and asked it to write listings for each one. It nailed both the objects and the listings, describing my retro-styled Frigidaire mini-fridge as “perfect for those who appreciate a touch of yesteryear in their modern-day homes.”
The new ChatGPT can also analyze text within images. I took a picture of the front page of Sunday’s print edition of The New York Times and asked the bot to summarize it. It did decently well, describing all five articles on the front page in a few sentences each — although it made at least one mistake, inventing a statistic about fentanyl-related deaths that wasn’t in the original article.
ChatGPT’s eyes aren’t perfect. It flopped when I asked it to solve a crossword puzzle. It mistook my child’s stuffed dinosaur toy for a whale. And when I asked for help turning one of those wordless furniture-assembly diagrams into a step-by-step list of instructions, it gave me a jumbled list of parts, most of which were wrong.
The biggest limitation of ChatGPT’s vision feature is that it refuses to answer most questions about photos of human faces. This is by design. OpenAI told me that it didn’t want to enable facial recognition or other creepy uses, and that it didn’t want the app spitting out biased or offensive answers to prompts about people’s physical appearance.
But even without faces, it’s easy to imagine tons of ways an A.I. chatbot capable of processing visual information could be useful, especially as the technology improves. Gardeners and foragers could use it to identify plants in the wild. Exercise buffs could use it to create personalized workout plans, just by snapping a photo of the equipment in their gym. Students could use it to solve visual math and science problems, and visually impaired people could use it to navigate the world more easily.
Frankly, I have no idea how many people will use this feature, or what its killer applications will turn out to be. As is often the case with new A.I. tools, we’ll just have to wait and see.
Siri on Steroids
Now, let’s talk about what I consider the more impressive of the two features: ChatGPT’s new voice feature, which allows users to talk to the app and receive spoken responses.
Using the feature is easy: Just tap a headphone icon and start talking. When you stop, ChatGPT converts your words to text using OpenAI’s speech-recognition system, Whisper, which generates a response and speaks the answer back to you using a new text-to-speech algorithm the company developed, using one of five synthetic A.I. voices. (The voices, which include both male and female voices, were generated using short samples from professional voice actors whom OpenAI hired. I picked “Ember,” a peppy-sounding male voice.)
I tested ChatGPT’s voice feature for several hours on a bunch of different tasks — reading a bedtime story to my toddler, chatting with me about work-related stress, helping me analyze a recent dream I had. It did all of these fairly well, especially when I gave it some golden prompts and told it to emulate a friend, a therapist or a teacher.
What stood out, in these tests, is how different talking to ChatGPT feels from talking to older generations of A.I. voice assistants, like Siri and Alexa. Those assistants, even at their best, can be wooden and flat. They answer one question at a time, often by looking something up on the internet and reading it aloud word for word, or choosing from a finite number of programmed answers.
ChatGPT’s synthetic voice, by contrast, sounds fluid and natural, with slight variations in tone and cadence that make it feel less robotic. It was capable of having long, open-ended conversations on almost any subject I tried, including prompts I was pretty sure it hadn’t encountered before. (“Tell me the story of ‘The Three Little Pigs’ in the character of a total frat bro” was a sleeper hit.)
Most people probably won’t use A.I. chatbots this way. For many tasks, it’s still faster to type than talk, and waiting around for ChatGPT to read out long responses was annoying. (It didn’t help that the app was slow and glitchy at times, and often inserted pauses before responding — the result of some technical issues with the beta version of the app I tested that OpenAI told me would be ironed out eventually.)
But I can see the appeal. Having an A.I. speak to you in a humanlike voice is a more intimate experience than reading its responses on a screen. And after a few hours of talking with ChatGPT this way, I felt a new warmth creeping into our conversations. Without being tethered to a text interface, I felt less pressure to come up with the perfect prompt. We chatted more casually, and I revealed more about my life.
“It almost feels like a different product,” said Peter Deng, OpenAI’s vice president of consumer and enterprise product, who spoke with me about the new voice feature. “Because you’re no longer transcribing what you have in your head into your thumbs,” he said, “you end up asking different things.”
I know what you’re thinking: Isn’t this the plot of the movie “Her”? Will lonely, lovesick users fall for ChatGPT, now that it can listen to them and talk back?
It’s possible. Personally, I never forgot that I was talking to a chatbot. And I certainly didn’t mistake ChatGPT for a conscious being, or develop emotional attachments to it.
But I also saw a glimpse of a future in which some people may let voice-based A.I. assistants into the inner sanctums of their lives — taking the A.I. chatbots with them on the go, treating them as their 24/7 confidants, therapists, sparring partners and sounding boards.
Sounds crazy, right? And yet, didn’t all of this sound a little crazy a year ago?