Multimodal: AI’s new frontier

6 months ago admin

Multimodality is a relatively new term for something extremely old: how people have learned about the world since humanity appeared. Individuals receive information from myriad sources via their senses, including sight, sound, and touch. Human brains combine these different modes of data into a highly nuanced, holistic picture of reality. “Communication between humans is multimodal,”…

A technology that sees the world from different angles

We are not there yet. The furthest advances in this direction have occurred in the fledgling field of multimodal AI. The problem is not a lack of vision. While a technology able to translate between modalities would clearly be valuable, Mirella Lapata, a professor at the University of Edinburgh and director of its Laboratory for Integrated Artificial Intelligence, says “it’s a lot more complicated” to execute than unimodal AI.

In practice, generative AI tools use different strategies for different types of data when building large data models—the complex neural networks that organize vast amounts of information. For example, those that draw on textual sources segregate individual tokens, usually words. Each token is assigned an “embedding” or “vector”: a numerical matrix representing how and where the token is used compared to others. Collectively, the vector creates a mathematical representation of the token’s meaning. An image model, on the other hand, might use pixels as its tokens for embedding, and an audio one sound frequencies.

A multimodal AI model typically relies on several unimodal ones. As Henry Ajder, founder of AI consultancy Latent Space, puts it, this involves “almost stringing together” the various contributing models. Doing so involves various techniques to align the elements of each unimodal model, in a process called fusion. For example, the word “tree”, an image of an oak tree, and audio in the form of rustling leaves might be fused in this way. This allows the model to create a multifaceted description of reality.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

Multimodal: AI’s new frontier

A technology that sees the world from different angles

Why the term “women of childbearing age” is problematic

The Download: diversifying AI voices, and a science-fiction glimpse into the future

How this grassroots effort could make AI voices more diverse

You may have missed

Parachute OTT Release Date: When and Where to Watch it Online?

Scientists Discover World’s Largest Coral Discovered in Solomon Islands

Study Finds Industrial Aerosols May Cause Local Snowfall by Freezing Clouds

William Wragg: Ex-Tory MP feels ‘enormous guilt’ over Westminster honeytrap scandal

New Policies to Cut Plastic Waste & Reduce Carbon Emissions by 2050

Categories

Useful Links

A technology that sees the world from different angles

More Stories

You may have missed

Categories

Useful Links