During training,
- Modality Encoder, LLM Backbone, and Modality Generator are generally maintained in a frozen state.
- The primary optimization emphasis is on Input and Output Projectors.
- Given that Projectors are lightweight components, the proportion of trainable parameters in MM-LLMs is notably small compared to the total parameter count (typically around 2%).

General architecture components

Modality Encoder

$F_{x} = M E_{x} (I_{x})$ , we want to extract features
Vision: CLIP, SigLip
Audio: Whisper, CLAP
ImageBind: joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data.
- allows for multimodal-conditioned generation
- aligns each modality’s embeddings to the image embeddings (potentially extracted from CLIP)
- no explicitly aligned pseudo-labeled dataset

The Input Projector $θ_{X \to T}$ is tasked with aligning the encoded features of other modalities $F_{x}$ with the text feature space $F_{T}$ . The aligned features as prompts $P_{x}$ are then fed into the LLM Backbone alongside the textual features $F_{T}$ .
Given $X$ -text dataset ${I_{X}, t}$ , the goal is to minimize the $X$ -conditioned text generation loss $L_{t x t - g e n} (LL M (P_{X}, F_{T}), t)$
- If we’re also generating the modality i.e. $t, S_{X} = LL M (P_{x}, F_{T})$ , the the loss becomes $L_{t x t - g e n} (LL M (P_{X}, F_{T}), (t, S_{X}))$

It produces (1) direct textual outputs t, and (2) signal tokens $S_{X}$ from other modalities (if any).
- These signal tokens act as instructions to guide the generator on whether to produce MM contents
- $t, S_{X} = LL M (P_{x}, F_{T})$

The Output Projector $θ_{t \to x}$ maps the signal token representations $S_{X}$ from the LLM Backbone into features $H_{X}$ understandable to the following Modality Generator $M G_{X}$ .
- $H_{X} = θ_{t \to x} (S_{X})$
Given $X$ -text dataset ${I_{X}, t}$ , $t$ is first fed into LLM to generate the corresponding $S_{X}$ , then mapped into $H_{X}$
- To facilitate alignment of the mapped features $H_{X}$ , the goal is to minimize the distance between $H_{X}$ and the conditional text representations of $M G_{X}$ :
  - $a r g min L_{m se} (H_{X}, τ_{X} (t))$
  - $τ_{X}$ is the textual condition encoder in $M G_{X}$

The Modality Generator $M G_{X}$ is tasked with producing outputs in distinct modalities.
Need to able to parse LLM response if the response is text only i.e. H_X is the identity
Common:
- Stable Diffusion (image)
- Zeroscope, Videofusion (video)
- AudioLDM-2 (audio)
Can also compute text-conditioned noise-matching loss to tune the input and output projectors
- $L_{g e n} = ∣∣ ϵ - ϵ_{X} (z_{t}, t, H_{x}) ∣ ∣_{2}^{2}$