Bi-directional attention

  • Bi-directional (or bidirectional) attention refers to the ability of a model to attend to tokens in both directions (left-to-right and right-to-left) when computing representations

Masked Language Modeling

  • Replace a percentage of tokens in the sequence by a mask token.
    • The mask token has no information, except positional.
  • Compute the softmax over the logits of the masked tokens, and compute cross-entropy loss
  • Main difference between next-token prediction objective is that the logits for a given token are predictions for its own token ID vs. the next token ID.

Cross-attention

  • comes from decoder-input, with dimension

  • comes from encoder-input, with dimension

  • The cross-attention map

    • , with dimension
  • The output from cross-attention is:

    • , with dimension
  • Thus, the cross-attention mixes information from the encoder into the decoder input tokens, and outputs updated decoder tokens

Prefix-LM

  • The Prefix-LM (or PrefixLM) is an architecture that enables bidirectional attention over the input prompt while maintaining causal (left-to-right) attention for the generated output
  • This is achieved by using different attention masks:
    • Bidirectional attention mask for the input prompt tokens
    • Causal attention mask for the output tokens

UL2 training objective