🤖 Harold's Notes

Search

❯

❯

❯

❯

Beyond GPT

Jul 03, 20241 min read

Bi-directional attention

Bi-directional (or bidirectional) attention refers to the ability of a model to attend to tokens in both directions (left-to-right and right-to-left) when computing representations

Masked Language Modeling

Replace a percentage of tokens in the sequence by a mask token.
- The mask token has no information, except positional.
Compute the softmax over the logits of the masked tokens, and compute cross-entropy loss
Main difference between next-token prediction objective is that the logits for a given token are predictions for its own token ID vs. the next token ID.

Cross-attention

$Q$ comes from decoder-input, with dimension $n_{d ec} \times d_{m o d e l}$
$K, V$ comes from encoder-input, with dimension $n_{e n c} \times d_{m o d e l}$
The cross-attention map
- $Q K^{T}$ , with dimension $n_{d ec} \times n_{e n c}$
The output from cross-attention is:
- $softmax (Q K^{T}) V$ , with dimension $n_{d ec} \times d_{m o d e l}$
Thus, the cross-attention mixes information from the encoder into the decoder input tokens, and outputs updated decoder tokens

Prefix-LM

The Prefix-LM (or PrefixLM) is an architecture that enables bidirectional attention over the input prompt while maintaining causal (left-to-right) attention for the generated output
This is achieved by using different attention masks:
- Bidirectional attention mask for the input prompt tokens
- Causal attention mask for the output tokens

UL2 training objective

Graph View

Bi-directional attention
Masked Language Modeling
Cross-attention
Prefix-LM
UL2 training objective

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2024