
Tensor parallelism, parallelizes the parts of the transformer layer that take the most time during training (attention and MLP) and as a result, it is computationally efficient.

However, it leaves the layernorms as well as the dropouts after attention and MLP blocks intact and as a result, they are replicated across the tensor parallel group.

These elements do not require a lot of compute but demand a considerable amount of activation memory
 Quantitatively, the $10sbh$ part of the above equation is due to these replicated operations and as a result they are not divided by the tensor parallel size t.

The operations are independent along the sequence dimension.
 This characteristic allows us to partition these regions along the sequence dimension s.
 Partitioning along the sequence dimension reduces the memory required for the activations.
 This extra level of parallelism introduces new communication collectives before which will act as converters between sequence and tensor parallel region
 If youβre smart, you can write good communication imperatives and the communication bandwidth used for tensor parallelism and tensor together with sequence parallelism are the same
Combining smartly with TensorParallelism

We need to fuse communication primitives used in Tensor Parallelism and use $g$ and $gΛβ$ instead to avoid extra communication
 In a MLP block, the input X is split among the sequence dimension to do the LayerNorm. The result of the LayerNorm is then gather and split along the hidden dimension

$g$ is an allgather operation along the sequence dimension in the forward pass
 reducescatter in the backward pass

$gΛβ$ is a reducescatter in the forward pass
 allgather in the backward pass

By splitting $A$ along its columns ($A_{1}$ and $A_{2}$) and B along its rows ($B_{1}$ and $B_{2}$ ), we avoid communications (for more details please see [19]) and arrive at $W_{1}$ and $W_{2}$. These two tensors are not parallel anymore and need to be summed as $W=W_{1}+W_{2}$ before they are fed into the dropout layer. However, dropout needs its input to be parallel in the sequence dimension s.

Instead of summing and then parallelizing in the sequence dimension, we combine these two operations into a reducescatter operation. As a result, $gΛβ$ can be a single reducescatter operation in the forward pass.
Communication costs
 Tensor parallelism requires four allreduces in a single forward and backward pass whereas tensor together with sequence parallelism requires four allgathers and four reducescatters in a single forward and backward pass.
 At the first look, it seems that tensor with sequence parallelism requires more communications compared to tensor parallelism.
 However, we note that a ring allreduce is composed of two steps: a reducescatter followed by an allgather. As a result, the communication bandwidth used for tensor parallelism and tensor together with sequence parallelism are the same. Therefore, sequence parallelism does not introduce any communication overhead.