Training objective

Captioning and QA objectives

  • caption {lang}
    • main body
  • ocr
    • Concatenation (in raster order) of all text on the image transcribed by a public OCR system. Potentially skipping random snippets of OCR in order to fit sequence length without biasing recognition towards the beginning of raster order.
  • answer en {question}
    • Generated VQA on CC3M-35L following with questions in 35 languages but English answers.
    • Additionally, English-only object-centric questions on OpenImages following:
      • listing: What objects are in the image?,
      • presence: Is {thing} in the image?,
      • multi-object presence: Which of {thing}, {thing}… are in the image?,
      • counting: How many {thing}?
  • question {lang} {English answer}
    • reverse order
    • Generated VQG on CC3M-35L generating questions in 35 languages, for a given English answer

Classic CV tasks

  • They add new tokens to Gemma’s vocabulary to support PaliGemma’s ability to perform more structured computer vision tasks. They add 1024 location tokens (<loc0000> to <loc1023>), which correspond to binned normalized image coordinates and are used in detection, referring expression, and grounded captioning tasks.

    • Different strategy from Molmo and PixMo, as for pointing, they expect outputs points as plain-text coordinates normalized between 0 and 100.
  • They also add 128 VQVAE tokenized single-object mask tokens (<seg000> to <seg127>) to support referring expression segmentation

  • detect {thing} ; {thing} ; ...

    • Multi-object detection similar to Pix2Seq on generated open-world data via pseudo-labeling as described in OWL-ViTv2.
  • caption <ymin><xmin><ymax><xmax>

    • Grounded captioning of what is in the box, following LocCa. The box is indicated by the same location tokens as used in detection an: normalized image coordinates binned to 1024 tokens.`
  • segment {thing} ; {thing} ; ...

    • Multi-object instance segmentation as in PaLI-3 on generated open-world data similar to OWL-ViTv2 and SAM.