🤖 Harold's Notes

Search

❯

❯

Resources to read

❯

Multi-modal

Dec 16, 20241 min read

SCALING LAWS FOR GENERATIVE MIXED-MODAL LANGUAGE MODELS
ViT improvements in tokenization https://x.com/wenhaoli29/status/1846217454059389410?s=46
DPO for VLMs
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
- LLM is already VERY CLOSE to a Unified Model! LLMs can be finetuned to unified models with instruction tuning. (unified autoregressive model capable of generating both text and visual token)
autoregressive image model training dynamics https://x.com/cloneofsimo/status/1868965620819374429
jetformer, multimodal LLM without VQ-VAE https://x.com/mtschannen/status/1863622784376586499?s=46
OmniGen is a unified image generation model that can generate a wide range of images from multi-modal prompts

Staples

Sigmoid Loss for Language Image Pre-Training
PaliGemma: A versatile 3B VLM for transfer
Scaling Vision Transformers

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025