Backprop
- If element is replicated (e.g. a normalization vector that is broadcasted), you take the gradient of each replica, and then sum all of them.
- In general, if there’s a sum along an axis in the forward pass, then there’s a broadcast along the same axis during the backward pass gradients. Vice-versa for broadcasting during the forward pass.
Chain rule
- is a function of , and . where is , is , is
- Due to the “bottleneck”-y nature of backprop, for the gradient of a parameter weight in a MLP, you only need the gradients of the activation vector dot product with the previous activations
- If , then to compute