Training objective
Captioning and QA objectives
caption {lang}
- main body
ocr
- Concatenation (in raster order) of all text on the image transcribed by a public OCR system. Potentially skipping random snippets of OCR in order to fit sequence length without biasing recognition towards the beginning of raster order.
answer en {question}
- Generated VQA on CC3M-35L following with questions in 35 languages but English answers.
- Additionally, English-only object-centric questions on OpenImages following:
- listing: What objects are in the image?,
- presence: Is {thing} in the image?,
- multi-object presence: Which of {thing}, {thing}… are in the image?,
- counting: How many {thing}?
question {lang} {English answer}
- reverse order
- Generated VQG on CC3M-35L generating questions in 35 languages, for a given English answer
Classic CV tasks
-
They add new tokens to Gemma’s vocabulary to support PaliGemma’s ability to perform more structured computer vision tasks. They add 1024 location tokens (
<loc0000>
to<loc1023>
), which correspond to binned normalized image coordinates and are used in detection, referring expression, and grounded captioning tasks.- Different strategy from Molmo and PixMo, as for pointing, they expect outputs points as plain-text coordinates normalized between 0 and 100.
-
They also add 128 VQVAE tokenized single-object mask tokens (
<seg000>
to<seg127>
) to support referring expression segmentation -
detect {thing} ; {thing} ; ...
- Multi-object detection similar to Pix2Seq on generated open-world data via pseudo-labeling as described in OWL-ViTv2.
-
caption <ymin><xmin><ymax><xmax>
- Grounded captioning of what is in the box, following LocCa. The box is indicated by the same location tokens as used in detection an: normalized image coordinates binned to 1024 tokens.`
-
segment {thing} ; {thing} ; ...
- Multi-object instance segmentation as in PaLI-3 on generated open-world data similar to OWL-ViTv2 and SAM.