-
ViT improvements in tokenization https://x.com/wenhaoli29/status/1846217454059389410?s=46
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
- LLM is already VERY CLOSE to a Unified Model! LLMs can be finetuned to unified models with instruction tuning. (unified autoregressive model capable of generating both text and visual token)
-
autoregressive image model training dynamics https://x.com/cloneofsimo/status/1868965620819374429
-
jetformer, multimodal LLM without VQ-VAE https://x.com/mtschannen/status/1863622784376586499?s=46