Backpropagation

If element is replicated (e.g. a normalization vector that is broadcasted), you take the gradient of each replica, and then sum all of them.
In general, if there’s a sum along an axis in the forward pass, then there’s a broadcast along the same axis during the backward pass gradients. Vice-versa for broadcasting during the forward pass.

Chain rule

$L$ is a function of $z$ , and $z = W x$ . where $z$ is $m \times 1$ , $W$ is $m \times n$ , $x$ is $n \times 1$
$\frac{d L}{d W} = \frac{d L}{d z} \times \frac{d z}{d W}$
Due to the “bottleneck”-y nature of backprop, for the gradient of a parameter weight in a MLP, you only need the gradients of the activation vector dot product with the previous activations
If $\overset{y}{^} = h_{1} W_{2}$ , then to compute $\frac{d L}{d W _{2}} = h_{1}^{T} \frac{d L}{d y ^}$