Multimodal AI
On the convergence between text, image, and audio and whatever else, as intelligence learns to get very general indeed
November 10, 2022 — April 1, 2025
Suspiciously similar content
I would like to have a full theory of multimodality in AI. Maybe that is coming soon. The basic idea is we want to have a shared semantic space between data of various modalities (such as text, images, and audio) that allows computing between them (e.g. text-to-image generation).
How does this work?
The following is a sequential ordering of models and discoveries, but not really a unifying theory of them; I do not understand enough to have one of those.
1 CLIP
The modern era of multimodal AI began with OpenAI’s CLIP (Contrastive Language-Image Pre-training)(Radford et al. 2021), which solved the problem of how to create a shared semantic space between images and text.
CLIP’s key insight was using contrastive learning to align these modalities. Rather than predicting pixels from text or vice versa, CLIP trained on 400 million image-text pairs by maximising agreement between matched pairs while minimising it for unmatched ones. Mathematically, this meant:
- Encoding images via a vision transformer: \(z_i = E_v(image)\)
- Encoding text via a text transformer: \(z_t = E_t(text)\)
- Optimizing a contrastive loss: \(L = -\log\frac{\exp(sim(z_i, z_t)/\tau)}{\sum_j\exp(sim(z_i, z_{t_j})/\tau)}\)
This approach created a joint embedding space where semantically similar concepts clustered together regardless of modality. A photo of a dog and the word “dog” would map to nearby vectors, enabling zero-shot classification, cross-modal retrieval, and all that other weird vector embedding stuff.
A very interesting exploration of the weaknesses of the method is Kang et al. (2025); it is also IMO a clearer explanation of what is happening than the original paper.
2 Conditioned diffusion etc
While CLIP established alignment, DALL-E (Ramesh et al. 2021) demonstrated generation by combining CLIP with autoregressive, transformer-type techniques. These were… ok. Conditional diffusion models like Stable Diffusion (Rombach et al. 2022) took this further by introducing a generative process that could be conditioned on CLIP embeddings.
- Start with pure noise
- Gradually denoise through a series of steps: \(x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t, c))\)
- Condition this process on CLIP text embeddings (\(c\))
This allowed generating high-quality images from text by guiding the denoising process through the shared CLIP embedding space. The mathematical elegance lay in framing generation as reversing a Markov diffusion process, with text embeddings steering this reversal.
Practically useful tips for these bad boys under generative art.
3 Flamingo and In-Context Learning
DeepMind’s Flamingo tried to augment CLIP to process interleaved images and text. Flamingo introduced:
- Perceiver Resampler: Converts variable-sized visual features to fixed-length tokens
- Gated Cross-Attention: Injects visual information into a frozen LLM via: \(h' = h + g \cdot \text{Attention}(W_q h, W_k v, W_v v)\)
This architecture enabled in-context visual learning. For example, showing Flamingo several examples of identifying the odd object in images, then asking it to do the same for a new image — without explicit training on this task.
4 Multimodal LLMs: GPT-4V and Gemini
The next evolution came with models like GPT-4V and Gemini, which integrated vision capabilities directly into LLMs. These systems:
- Encode images into tokens compatible with the LLM vocabulary
- Interleave these tokens with text in the attention mechanism
- Apply causal masking to maintain the autoregressive property
This allowed for complex reasoning about visual content. For instance, GPT-4V could analyze a refrigerator’s contents and suggest recipes, demonstrating both visual understanding and practical reasoning.
5 Unified Embedding Spaces
Recent advances focus on creating seamlessly unified embedding spaces across more modalities. I need to learn what is happening here, but let me dump some keywords and get an LLM to expand them:
5.1 Hyperbolic Embeddings
Euclidean spaces struggle with hierarchical relationships. Hyperbolic embeddings solve this by representing data in negatively curved Poincaré or Lorentz spaces, where:
\(d(x,y) = \text{acosh}(1 + 2\frac{||x-y||^2}{(1-||x||^2)(1-||y||^2)})\)
This allows embedding hierarchical structures (like “dog” → “mammal” → “animal”) more efficiently, preserving both broad categorical relationships and fine-grained distinctions.
5.2 Optimal Transport Alignment
Rather than point-wise alignment, modern approaches use optimal transport theory to align entire distributions between modalities. Given distributions \(P\) and \(Q\) from different modalities:
- Compute the optimal transport plan \(\gamma^*\) minimising \(\int c(x,y)d\gamma(x,y)\)
- Learn encoders that minimise this transport cost between modality distributions
This preserves the global structure of each modality’s semantic space while aligning them. For example, in medical applications, this allows aligning MRI images, patient records, and diagnostic text while preserving the unique statistical properties of each.
5.3 Neural Operators for Continuous Modalities
For continuous signals like video or audio, Fourier Neural Operators (FNOs) have emerged as powerful tools. FNOs:
- Lift input functions to a higher-dimensional space
- Apply convolutions in the Fourier domain: \((\mathcal{K}v)(x) = \mathcal{F}^{-1}(R \cdot \mathcal{F}(v))(x)\)
- Project back to the output space
This allows processing time-series data alongside discrete modalities. For instance, in autonomous vehicles, FNOs can process continuous LiDAR streams alongside discrete command inputs in a unified framework.
5.4 Platonic representation hypothesis
Are the semantics of embeddings for different modalities represented in a common “Platonic” space which is universal across different architectures? (Huh et al. 2024) If so, should we care?