Bi-directional attention
- Bi-directional (or bidirectional) attention refers to the ability of a model to attend to tokens in both directions (left-to-right and right-to-left) when computing representations
Masked Language Modeling
- Replace a percentage of tokens in the sequence by a mask token.
- The mask token has no information, except positional.
- Compute the softmax over the logits of the masked tokens, and compute cross-entropy loss
- Main difference between next-token prediction objective is that the logits for a given token are predictions for its own token ID vs. the next token ID.
Cross-attention
-
comes from decoder-input, with dimension
-
comes from encoder-input, with dimension
-
The cross-attention map
- , with dimension
-
The output from cross-attention is:
- , with dimension
-
Thus, the cross-attention mixes information from the encoder into the decoder input tokens, and outputs updated decoder tokens
Prefix-LM
- The Prefix-LM (or PrefixLM) is an architecture that enables bidirectional attention over the input prompt while maintaining causal (left-to-right) attention for the generated output
- This is achieved by using different attention masks:
- Bidirectional attention mask for the input prompt tokens
- Causal attention mask for the output tokens