DeltaNet

https://sustcsonglin.github.io/blog/2024/deltanet-1/

Vanilla Softmax

$Parallel training : Iterative inference : O = softmax (Q K^{⊤} ⊙ M) V o_{t} = j = 1 \sum t \frac{exp ( q _{t}^{⊤} k _{j} )}{\sum _{l = 1}^{t} exp ( q _{t}^{⊤} k _{l} )} v_{j} \in R^{L \times d} \in R^{d}$

Linear Attention

$Parallel training ： Iterative inference ： O = (Q K^{⊤} ⊙ M) V o_{t} = j = 1 \sum t (q_{t}^{⊤} k_{j}) v_{j} \in R^{L \times d} \in R^{d}$

While removing softmax alone doesn’t immediately reduce computational complexity, it enables a crucial mathematical property: linearity.
Linearity gives us associativity ⇒ chunkwise parallel form for prefill

Linear Attention as a linear RNN

For inference, we can rearrange things $o_{t} = j = 1 \sum t v_{j} (k_{j}^{⊤} q_{t}) = (j = 1 \sum t v_{j} k_{j}^{⊤}) q_{t} (t r an s p os i t i o n) k_{j}^{⊤} q_{t} = q_{t}^{⊤} k_{j} \in R By associativity$
Let’s define a state matrix $S_{t} = \sum_{j = 1}^{t} v_{j} k_{j}^{⊤}$
- Then we have a clear recurrent relationship
- $S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤} \in R^{d \times d}, o_{t} = S_{t} q_{t} \in R^{d}$
Linear attention is essentially a linear RNN with a matrix-valued state $S$ that accumulates key-value outer-products, and keeps track of a compressed view of size $O (d^{2})$ of the history.
This is called a state size expansion from $O (d)$ to $O (d^{2})$
By casting linear attention as an RNN,
- we reduce inference cost from $O (L d)$ to $O (d^{2})$
- we reduce the space complexity to $O (L d)$ to O( $d^{2}$ )

Limitations of Linear Attention - defining retrieval error

The fixed-size state matrix in linear attention means it cannot perfectly preserve all historical information, making exact retrieval particularly challenging.

Retrieval error

Linear attention implements a key-value associative memory, which is the sum of outer products between keys and values $S = \sum v_{i} k_{i}^{⊤}$
Assuming all keys are normalized to unit length, when we try to retrieve a value associated with a specific key $k_{j}$ , i.e. the dot-product between the state matrix $S$ and the key $k_{j}$ should ideally give us back $v_{j}$ (because $k_{j}^{⊤} k_{j} = 1$ )
- $S k_{j} = \sum v_{i} (k_{i}^{⊤} k_{j}) = v_{j} + retrieval error i \neq = j \sum (k_{i}^{⊤} k_{j}) v_{i}$
To minimize the retrieval error term, we need all they key vectors to be orthogonal ( $k_{i}^{⊤} k_{j} = 0$ )
- However, in a space of dimension $d$ , we can not define more than $d$ vectors that are all orthogonal to each other.
- $d$ is the head-dimension, and it has been shown that increasing head-dimension improves performance, as it gives more space for storing distinct-key value pairs
- There is a tradeoff between increasing head-dim, and allowing chunked linear attention to be hardware friendly (we want tiles that are small enough to keep in registers)

Gating or forgetting as a mechanism to improve retrieval

In this key-value associative memory system, we can only add new key-value associations without the ability to erase existing information. As sequences grow longer, this leads to accumulating “retrieval errors” that degrade performance.
We can narrow the performance gap with standard attention in language modeling tasks by incorporating a forgetting mechanism
- $S_{t} = G_{t} ⊙ S_{t - 1} + v_{t} k_{t}^{⊤}$ , where $G_{t} \in R^{d \times d}$
- There are multiple different structured parameterization (for parameter efficiency), often with outer product structure.
  - $G_{t} = β_{t} α_{t}^{⊤}$ (decaying fast weight)
  - $G_{t} = 1 α_{t}^{⊤}$ (GLA)
  - $G_{t} = exp (- (Δ_{t} 1^{⊤}) ⊙ exp (A))$ (Mamba)
  - $G_{t} = γ_{t} 1 1^{⊤}$ (Mamba 2)

DeltaNet: Linear Attention with Delta Rule

What is the Delta Rule

Very simple error-correction learning principle
- principle: adjust the model’s parameters based on the difference (delta) between what we want (target) and what we actually get (prediction).
Imagine teaching a child to aim at a target. If they shoot too far to the left, you’d tell them to adjust right; too far right, adjust left. T
The size of the adjustment depends on
- the delta size
- the magnitude of the input itself (in the linear regression case)

Pseudocode

import numpy as np
 
def delta_rule(x, y, epochs=100, lr=0.1):
    """
    Simple delta rule implementation
    x: input features (N samples by D features)
    y: target values (N samples)
    """
    # Initialize weights
    w = np.zeros(x.shape[1])
    
    # Train
    for _ in range(epochs):
        for i in range(len(x)):
            # Forward pass
            pred = np.dot(x[i], w)
            
            # Compute error
            error = y[i] - pred
            
            # Update weights
            w += lr * error * x[i]
            
    return w

DeltaNet

DeltaNet applies this error-correction principle to linear attention. Instead of simply accumulating key-value outer product, it updates its state based on prediction errors:

$\begin{align*} \mathbf{S}_{t} &= \mathbf{S}_{t-1} - \beta_t(\mathbf{S}_{t-1} \mathbf{k}_t - \mathbf{v}_t)\mathbf{k}_t^\top \\ &= \mathbf{S}_{t-1} - \beta_t \mathbf{S}_{t-1} \mathbf{k}_t \mathbf{k}_t^\top + \beta_t \mathbf{v}_t \mathbf{k}_t^\top \end{align*}$

The parallel to the Delta Rule becomes clear when we break down the components:
- $β_{t} \in R$ acts as the learning rate
- $k_{t} \in R^{d}$ is the input data
- $v_{t} \in R^{d}$ is the target
- $S_{t - 1} k_{t} \in R^{d}$ is our current prediction (trying to retrieve $v_{t}$ from the state matrix)
Think of $S_{t - 1} k_{t}$ as retrieving the “old value” associated with the current key $k_{t}$ from memory. When we encounter a newly associated value $v_{t}$ for the same key, rather than blindly overwriting, we make a careful update: $\begin{align*} \mathbf{v}_t^{\text{new}} &= (1-\beta_t) \mathbf{v}_t^{\text{old}} + \beta_t \mathbf{v}_t, \\ \mathbf{S}_t &= \mathbf{S}_{t-1} - \underbrace{\mathbf{v}_t^{\text{old}} \mathbf{k}_t^\top}_{\text{erase}} + \underbrace{\mathbf{v}_t^{\text{new}} \mathbf{k}_t^\top}_{\text{write}} \end{align*}$
where $v_{t}^{new}$ is a learned combination of the old and current values, controlled by a dynamic $β_{t} \in (0, 1)$ : when $β_{t} = 0$ , the memory content remains intact, and when $β_{t} = 1$ , we completely replace the old associated value with the new one.

DeltaNet as the gradient update for MSE

DeltaNet’s update rule can be derived by sequentially minimizing the mean squared error (MSE) between the desired output and the predicted output at each time step using gradient descent: $L_{t} (S) = \frac{1}{2} ∥ S k_{t} - v_{t} ∥^{2}$
Applying gradient descent to minimize this MSE loss gives:

$S_{t} = S_{t - 1} - η_{t} \nabla L_{t} (S_{t - 1}) = S_{t - 1} - η_{t} (S_{t - 1} k_{t} - v_{t}) k_{t}^{⊤}$

When the learning rate is set to $β_{t}$ , we recover DeltaNet
In contrast, vanilla linear attention employs a linear loss function: $L_{t}^{'} (S) = - ⟨ S k_{t}, v_{t} ⟩$

🤖 Harold's Notes

Explorer

DeltaNet

Vanilla Softmax

Linear Attention

Linear Attention as a linear RNN

Limitations of Linear Attention - defining retrieval error

Retrieval error

Gating or forgetting as a mechanism to improve retrieval

DeltaNet: Linear Attention with Delta Rule

What is the Delta Rule

Pseudocode

DeltaNet

DeltaNet as the gradient update for MSE

Graph View

Table of Contents

Backlinks