🤖 Harold's Notes

Search

❯

❯

❯

❯

PaliGemma

Sep 19, 20252 min read

Training objective

Captioning and QA objectives

caption {lang}
- main body
ocr
- Concatenation (in raster order) of all text on the image transcribed by a public OCR system. Potentially skipping random snippets of OCR in order to fit sequence length without biasing recognition towards the beginning of raster order.
answer en {question}
- Generated VQA on CC3M-35L following with questions in 35 languages but English answers.
- Additionally, English-only object-centric questions on OpenImages following:
  - listing: What objects are in the image?,
  - presence: Is {thing} in the image?,
  - multi-object presence: Which of {thing}, {thing}… are in the image?,
  - counting: How many {thing}?
question {lang} {English answer}
- reverse order
- Generated VQG on CC3M-35L generating questions in 35 languages, for a given English answer

Classic CV tasks

They add new tokens to Gemma’s vocabulary to support PaliGemma’s ability to perform more structured computer vision tasks. They add 1024 location tokens (<loc0000> to <loc1023>), which correspond to binned normalized image coordinates and are used in detection, referring expression, and grounded captioning tasks.
- Different strategy from Molmo and PixMo, as for pointing, they expect outputs points as plain-text coordinates normalized between 0 and 100.
They also add 128 VQVAE tokenized single-object mask tokens (<seg000> to <seg127>) to support referring expression segmentation
detect {thing} ; {thing} ; ...
- Multi-object detection similar to Pix2Seq on generated open-world data via pseudo-labeling as described in OWL-ViTv2.
caption <ymin><xmin><ymax><xmax>
- Grounded captioning of what is in the box, following LocCa. The box is indicated by the same location tokens as used in detection an: normalized image coordinates binned to 1024 tokens.`
segment {thing} ; {thing} ; ...
- Multi-object instance segmentation as in PaLI-3 on generated open-world data similar to OWL-ViTv2 and SAM.

Graph View

Training objective
Captioning and QA objectives
Classic CV tasks

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025