input is the tensor to be reduced and scattered. Its size should be output tensor size times the group size. can be a concatenation or a stack.
dist.gather(tensor, gather_list, dst, group): Copies tensor from all processes in dst.
dist.all_gather(tensor_list, tensor, group): Copies tensor from all processes to tensor_list, on all processes.
other equivalent dist.all_gather_into_tensor(output_tensor, input_tensor, group)
output_tensor must be correctly sized i.e.
a concatenation of all the input tensors along the primary dimension
OR a stack of all the input tensors along the primary dimension
dist.barrier(group): Blocks all processes in group until each one has entered this function.
Point-to-Point Communication
A transfer of data from one process to another is called a point-to-point communication. These are achieved through the send and recv functions or their immediate counter-parts, isend and irecv.
Example
Two processes communicate a tensor
Both processes start with a zero tensor, then process 0 increments the tensor and sends it to process 1 so that they both end up with 1.0. Notice that process 1 needs to allocate memory in order to store the data it will receive.
Blocking point-to-point communication
Non-Blocking point-to-point communication
When using immediates we have to be careful about how we use the sent and received tensors. Since we do not know when the data will be communicated to the other process, we should not modify the sent tensor nor access the received tensor before req.wait() has completed.
In other words:
writing to tensor after dist.isend() will result in undefined behaviour.
reading from tensor after dist.irecv() will result in undefined behaviour.
However, after req.wait() has been executed we are guaranteed that the communication took place, and that the value stored in tensor[0] is 1.0.