Tensor Parallelism

We can horizontally partition the computation for one tensor operation across multiple devices, named Tensor parallelism (TP).
We write the derivations by splitting in two, however in general, the matrixes are split equally within the GPUs in a given node

$X$ is $B \times d_{m o d e l}$ , A is $d_{model} \times d_{hidden}$$
First option (parallelize and aggregate, each “thread” computes a matrix of the same dimension as $X A$ ) (more memory efficient, but requires all_reduce at the end): - split $A$ along its rows and input $X$ along its columns: - $X = [X_{1}, X_{2}]$ , $A = [A_{1}, A_{2}]^{T}$ - $X_{i}$ is $B \times d_{m o d e l} /2$ and $A_{i}$ is $d_{m o d e l} /2 \times d_{hi dd e n}$ - Then, $X A = X_{1} A_{1} + X_{2} A_{2}$ (it’s true, I checked) - intuition: - A matrix-matrix mul can be seen as multiple matrix-vector mul concatenated (along the columns of A). - In such matrix-vector mul, each element of a given column in $A$ is responsible for picking out its corresponding column in $X$ , mutliply it by itself, and then the result is summed to obtain the new column in $X A$ - Here, we parallelize the computation over these columns, and aggregate at the end
Second option (parallelize and concatenate, each “thread” produces a slice of $X A$ ) (less memory efficient but no synchronization):
- Split $A$ along its columns $A = [A_{1}, A_{2}]$
- $[Y_{1}, Y_{2}] = [X A_{1}, X A_{2}]$

$X$ is $B \times d_{m o d e l}$ , A is $d_{m o d e l} \times d_{hi dd e n}$ , B is $d_{hi dd e n} \times d_{m o d e l}$
Usual two-layer MLP block is $f (X) = G e LU (X A) B$
- Y = $G e LU (X A)$
- i.e. one GEMM (general matrix multiply)
- one GeLU
- one GEMM
Parallelizing the $Y = G e LU (X A)$
- First option (parallelize and aggregate, each “thread” computes a matrix of the same dimension as $X A$ ):
- - GeLU is nonlinear, so $G e LU (X_{1} A_{1} + X_{2} A_{2}) \neq = G e LU (X_{1} A_{1}) + G e LU (X_{2} A_{2})$
    - Thus we need to sychnronize before the GeLU function
- Second option (parallelize and concatenate, each “thread” produces a slice of $X A$ ):
  - Split $A$ along its columns $A = [A_{1}, A_{2}]$
- This partitioning allows the GeLU nonlinearity to be independently applied to the output of each partitioned GEMM $[Y_{1}, Y_{2}] = [G e LU (X A_{1}), G e LU (X A_{2})]$
- This is advantageous as it removes a synchronization point
Parallelizing $Z = Y B$
- Given we receive $Y$ = $[Y_{1}, Y_{2}]$ , split by the columns, we split $B = [B_{1}, B_{2}]^{T}$ by its rows
- Compute $Z_{i} = Y_{i} B_{i}$
- Synchronization
  - Z = all_reduce(Y_iB_i) by summing them
  - Called $g$ in the diagram
Diagram
- $g$ is an all-reduce in the forward where the matrix are aggregated by summing, identity (or splitting) in the backward
- $f$ is an identity (or splitting) in the forward, and an all-reduce in the backward

They exploit inherent parallelism in the multihead attention operation.
- partitioning the GEMMs associated with key (K), query (Q), and value (V ) in a column parallel fashion such that the matrix multiply corresponding to each attention head is done locally on one GPU
- This allows us to split per attention head parameters and workload across the GPUs, and doesnt require any immediate communication to complete the self-attention.
- The subsequent GEMM $Z = Y B$ from the output linear layer (after self attention) is parallelized along its rows (i.e. $B = [B_{1}, B_{2}]^{T}$ ), given it receives the self-attention $Y = Self-Attention (X)$ split by columns, by design (requiring no communication)
- Finally, we apply $g$ , the all_reduce to obtain the result (before dropout)
Diagram

🤖 Harold's Notes