Tensor Parallelism

We can horizontally partition the computation for one tensor operation across multiple devices, named Tensor parallelism (TP).

Summary

Normal GEMM is $A W^{T}$ (`batch x in_features) x (in_features x out_features)
- output is (batch x out_features)
column-linear
- weight matrix is sharded by the columns (i.e. by the out_features dimension)
- input is not sharded
- GEMM on rank i: $A W_{i}^{T}$ (`batch x in_features) x (in_features x out_features/world_size)
- output is (batch x out_features/world_size)
- output is sharded by the out_features dimension but correct
  - no communication needed (except a gather if we’re at the end)
row-linear
- weight matrix is sharded by the rows (in_features dimension)
- input is sharded by the rows (i.e. by the in_features dimension)
- GEMM on rank i: $A_{i} W_{i}^{T}$ (`batch x in_features / world_size) x (in_features/world_size x out_features)
- output is (batch x out_features)
- we need to all reduce to obtain the correct output
  - `output = all_reduce(output, tp_group)

We write the derivations by splitting in two, however in general, the matrixes are split equally within the GPUs in a given node

$X$ is $B \times d_{m o d e l}$ , A is $d_{model} \times d_{hidden}$$
First option (parallelize and aggregate, each “thread” computes a matrix of the same dimension as $X A$ ) (more memory efficient, but requires all_reduce at the end): - split $A$ along its rows and input $X$ along its columns: - $X = [X_{1}, X_{2}]$ , $A = [A_{1}, A_{2}]^{T}$ - $X_{i}$ is $B \times d_{m o d e l} /2$ and $A_{i}$ is $d_{m o d e l} /2 \times d_{hi dd e n}$ - Then, $X A = X_{1} A_{1} + X_{2} A_{2}$ (it’s true, I checked) - intuition: - A matrix-matrix mul can be seen as multiple matrix-vector mul concatenated (along the columns of A). - In such matrix-vector mul, each element of a given column in $A$ is responsible for picking out its corresponding column in $X$ , mutliply it by itself, and then the result is summed to obtain the new column in $X A$ - Here, we parallelize the computation over these columns, and aggregate at the end
Second option (parallelize and concatenate, each “thread” produces a slice of $X A$ ) (less memory efficient but no synchronization):
- Split $A$ along its columns $A = [A_{1}, A_{2}]$
- $[Y_{1}, Y_{2}] = [X A_{1}, X A_{2}]$

$X$ is $B \times d_{m o d e l}$ , A is $d_{m o d e l} \times d_{hi dd e n}$ , B is $d_{hi dd e n} \times d_{m o d e l}$
Usual two-layer MLP block is $f (X) = G e LU (X A) B$
- Y = $G e LU (X A)$
- i.e. one GEMM (general matrix multiply)
- one GeLU
- one GEMM
Parallelizing the $Y = G e LU (X A)$
- First option (parallelize and aggregate, each “thread” computes a matrix of the same dimension as $X A$ ):
- - GeLU is nonlinear, so $G e LU (X_{1} A_{1} + X_{2} A_{2}) \neq = G e LU (X_{1} A_{1}) + G e LU (X_{2} A_{2})$
    - Thus we need to sychnronize before the GeLU function
- Second option (parallelize and concatenate, each “thread” produces a slice of $X A$ ):
  - Split $A$ along its columns $A = [A_{1}, A_{2}]$
- This partitioning allows the GeLU nonlinearity to be independently applied to the output of each partitioned GEMM $[Y_{1}, Y_{2}] = [G e LU (X A_{1}), G e LU (X A_{2})]$
- This is advantageous as it removes a synchronization point
Parallelizing $Z = Y B$
- Given we receive $Y$ = $[Y_{1}, Y_{2}]$ , split by the columns, we split $B = [B_{1}, B_{2}]^{T}$ by its rows
- Compute $Z_{i} = Y_{i} B_{i}$
- Synchronization
  - Z = all_reduce(Y_iB_i) by summing them
  - Called $g$ in the diagram
Diagram
- $g$ is an all-reduce in the forward where the matrix are aggregated by summing, identity (or splitting) in the backward
- $f$ is an identity (or splitting) in the forward, and an all-reduce in the backward

They exploit inherent parallelism in the multihead attention operation.
- partitioning the GEMMs associated with key (K), query (Q), and value (V ) in a column parallel fashion such that the matrix multiply corresponding to each attention head is done locally on one GPU
- This allows us to split per attention head parameters and workload across the GPUs, and doesnt require any immediate communication to complete the self-attention.
- The subsequent GEMM $Z = Y B$ from the output linear layer (after self attention) is parallelized along its rows (i.e. $B = [B_{1}, B_{2}]^{T}$ ), given it receives the self-attention $Y = Self-Attention (X)$ split by columns, by design (requiring no communication)
- Finally, we apply $g$ , the all_reduce to obtain the result (before dropout)
Diagram