Multimodality is set to redefine how businesses leverage AI in 2025. Imagine AI that understands not only text, but also images, audio, and other sensor data. Humans are naturally multimodal. However, humans are limited in the amount of input they can process. Take the example of health care. During my time at Google Health, I heard many stories where patients overwhelmed doctors with data:
Imagine a patient with atrial fibrillation (AFIB) presenting with five years of detailed sleep data collected from their smartwatch. Or take the case of the cancer patient who arrives with a 20-pound stack of medical records documenting all the treatments he received. These two situations are very real. For doctors, the challenge is the same: separating the signal from the noise.
What is needed is an AI that can summarize and highlight key points. Large language models, like ChatGPT, already do this with text, extracting the most relevant information. But what if we could teach AI to do the same with other types of data, like images, time series, or lab results?
How does multimodal AI work?
To understand how multimodality works, let’s start with the fact that AI needs data both to train and to make predictions. Multimodal AI is designed to handle various data sources simultaneously: text, images, audio, video, and even time series data. By combining these contributions, multimodal AI offers a richer and more complete understanding of the problems it addresses.
Multimodal AI is more of a discovery tool. The different modalities of data are stored by the AI. Once a new data point is entered, the AI finds nearby subjects. For example, by entering sleep data from a person’s smart watch along with information about their episodes of atrial fibrillation (AFIB), the doctor could find indications of sleep apnea.
Note that this is based on “proximity” and not correlation. This is the scaled-up version of what Amazon once popularized: “People who bought this item also bought this item.” In this case, it is rather: “People with this type of sleep pattern have also been diagnosed with FAFI.”
Multimodal explained: encoders, fusion and decoders
A multimodal AI system consists of three main components: encoders, fusion, and decoders.
Encoding of any modality
Encoders convert raw data (e.g. text, images, sounds, log files, etc.) into a representation that AI can work with. These are called vectorswhich are stored in a latent space. To simplify, think of this process as storing an item in a warehouse (latent space), where each item has a specific location (vector). Encoders can process virtually anything: images, text, audio, video, log files, IoT information (sensors), time series, and more.
Merger mechanism: combine the terms
When working with a data type, like images, encoding is enough. But with multiple types of data (images, sounds, text, or time series data), we need to merge the information to find what is most relevant.
Decoders: generating outputs we understand
Decoders “decode” information from the latent space – i.e. the warehouse – and deliver it to us. It goes from raw, abstract information to something we understand. For example, find a picture of a “house”.
If you want to learn more about encoding, decoding and reordering, join my eCornell online certificate course at “Design and build AI solutions.“It’s a no-coding program that explores all aspects of AI solutions.
Transforming e-commerce with multimodality
Let’s look at another example: e-commerce. Amazon’s interface hasn’t changed much in 25 years: you type in a keyword, scroll through the results, and hope to find what you need. Multimodality can transform this experience by allowing you to describe a product, upload a photo, or provide context to find your ideal solution.
Correct search with multimodal AI
HAS r2decidesa company started by a few Cornellians and I, we use multimodality to merge search, navigation, and chat into a single seamless flow. Our clients are e-commerce businesses tired of losing revenue because their users couldn’t find what they needed. Multimodal AI is at the heart of our solution.
For example, in an online jewelry store, a user searching for “green” would – in the past – only see green jewelry if the word “green” appeared in the product text. From r2decidesThe AI also encodes images in a shared latent space (e.g., a warehouse), which it finds “green” in all modalities. Items are then re-ranked based on the user’s previous searches and clicks to ensure they receive the most relevant “green” options.
Users can also search for broader contexts, such as “wedding,” “red dress,” or “gothic.” The AI encodes these inputs into the latent space, matches them to the appropriate products, and displays the most relevant results. This ability even extends to brand names like “Swarovski,” making items appear relevant even if the store doesn’t officially carry Swarovski products.
AI-generated nudges to give chat-like advice
Alongside search results, R2Decide also generates AI-powered nudges: contextual recommendations or prompts designed to improve the user experience. These nudges are powered by AI agents, as I described yesterday in my article on agentic AI. Their goal is to effortlessly guide users to the most relevant options, making the search process intuitive, engaging and efficient.
Multimodality in 2025: endless possibilities for businesses
Multimodality is transforming industries from healthcare to e-commerce. And it doesn’t stop there. Startups like TC Laboratories use multimodal AI to streamline engineering workflows, improving efficiency and quality, while Toyota uses it for interactive and personalized customer support.
2025 will be the year when multimodal AI transforms how businesses operate. Follow me here on Forbes or on LinkedIn for more of my AI predictions for 2025.