 We can horizontally partition the computation for one tensor operation across multiple devices, named Tensor parallelism (TP).
 We write the derivations by splitting in two, however in general, the matrixes are split equally within the GPUs in a given node
Parallelizing a GEMM (General Matrix multiply)

$X$ is $BΓd_{model}$ , A is $d_{model} \times d_{hidden}$$

First option (parallelize and aggregate, each βthreadβ computes a matrix of the same dimension as $XA$) (more memory efficient, but requires all_reduce at the end):  split $A$ along its rows and input $X$ along its columns:  $X=[X_{1},X_{2}]$, $A=[A_{1},A_{2}]_{T}$  $X_{i}$ is $BΓd_{model}/2$ and $A_{i}$ is $d_{model}/2Γd_{hidden}$  Then, $XA=X_{1}A_{1}+X_{2}A_{2}$ (itβs true, I checked)  intuition:  A matrixmatrix mul can be seen as multiple matrixvector mul concatenated (along the columns of A).  In such matrixvector mul, each element of a given column in $A$ is responsible for picking out its corresponding column in $X$, mutliply it by itself, and then the result is summed to obtain the new column in $XA$  Here, we parallelize the computation over these columns, and aggregate at the end

Second option (parallelize and concatenate, each βthreadβ produces a slice of $XA$) (less memory efficient but no synchronization):
 Split $A$ along its columns $A=[A_{1},A_{2}]$
 $[Y_{1},Y_{2}]=[XA_{1},XA_{2}]$
MLP

$X$ is $BΓd_{model}$ , A is $d_{model}Γd_{hidden}$, B is $d_{hidden}Γd_{model}$

Usual twolayer MLP block is $f(X)=GeLU(XA)B$
 Y = $GeLU(XA)$
 i.e. one GEMM (general matrix multiply)
 one GeLU
 one GEMM

Parallelizing the $Y=GeLU(XA)$

First option (parallelize and aggregate, each βthreadβ computes a matrix of the same dimension as $XA$):

 GeLU is nonlinear, so $GeLU(X_{1}A_{1}+X_{2}A_{2})ξ=GeLU(X_{1}A_{1})+GeLU(X_{2}A_{2})$
 Thus we need to sychnronize before the GeLU function
 GeLU is nonlinear, so $GeLU(X_{1}A_{1}+X_{2}A_{2})ξ=GeLU(X_{1}A_{1})+GeLU(X_{2}A_{2})$

Second option (parallelize and concatenate, each βthreadβ produces a slice of $XA$):
 Split $A$ along its columns $A=[A_{1},A_{2}]$

This partitioning allows the GeLU nonlinearity to be independently applied to the output of each partitioned GEMM $[Y_{1},Y_{2}]=[GeLU(XA_{1}),GeLU(XA_{2})]$

This is advantageous as it removes a synchronization point


Parallelizing $Z=YB$
 Given we receive $Y$ = $[Y_{1},Y_{2}]$, split by the columns, we split $B=[B_{1},B_{2}]_{T}$ by its rows
 Compute $Z_{i}=Y_{i}B_{i}$
 Synchronization
 Z =
all_reduce(Y_iB_i)
by summing them  Called $g$ in the diagram
 Z =

Diagram
 $g$ is an allreduce in the forward where the matrix are aggregated by summing, identity (or splitting) in the backward
 $f$ is an identity (or splitting) in the forward, and an allreduce in the backward
SelfAttention

They exploit inherent parallelism in the multihead attention operation.

partitioning the GEMMs associated with key (K), query (Q), and value (V ) in a column parallel fashion such that the matrix multiply corresponding to each attention head is done locally on one GPU

This allows us to split per attention head parameters and workload across the GPUs, and doesnt require any immediate communication to complete the selfattention.

The subsequent GEMM $Z=YB$ from the output linear layer (after self attention) is parallelized along its rows (i.e. $B=[B_{1},B_{2}]_{T}$), given it receives the selfattention $Y=SelfAttention(X)$ split by columns, by design (requiring no communication)

Finally, we apply $g$, the all_reduce to obtain the result (before dropout)


Diagram