• They demonstrate that joint multimodal training can achieve parity across all modalities—i.e., no modality-specific performance degradation, while markedly enhancing cross-modal capabilities such as video understanding

  • A key ingredient is mixing unimodal and cross-modal data during the early stage of text pretraining