Summary

  • We want continuous embeddings for understanding

  • We want discrete image tokens for auto-regressive generation

  • If we use both, this generally forces the model to process different image tokens types, one from high-level semantic space, and one from low-level spatial space

    • This creates significant task conflict, can create competition, and limit capacity

Manzano

Architecture

  • Manzano employs a unified shared visual encoder with two lightweight and specialized adapters:

    • a continuous adapter for understanding tasks
    • a discrete adapter for generation
  • Because two adaptors originate from the same encoder, it yields hybrid representations from a homogeneous source, significantly mitigating task conflict in the LLM.

  • To obtain the two adapters, they first pre-train the hybrid tokenizer with a small LLM decoder to pre-align the image features with some LLM feature space.

    • During training, one of the adapter output is randomly chosen and passed to a small LLM decoder for alignment.
  • They leverage a diffusion image decoder to render pixels by taking the generated image tokens as conditioning. (it’s trained from scratch)

Training

  • The autoregressive multimodal LLMs are jointly trained on a mixture of pure text, image understanding, and image generation data.

  • Joint recipe to learn image understanding and generation simultaneously.

  • This training consists of three stages:

    • a pre-training stage on a large-scale corpus of text-only,
      • interleaved image-text
      • image-to-text (IT),
      • and text-to-image (TI) data;
    • a continued pre-training stage on higher-quality IT and TI data
    • a supervised fine-tuning (SFT) stage on curated text, IT, and TI instruction data
  • They freeze the vision encoder and discrete adapter during training to ensure that the codebook size stays fixed.