Backprop

  • If element is replicated (e.g. a normalization vector that is broadcasted), you take the gradient of each replica, and then sum all of them.
  • In general, if there’s a sum along an axis in the forward pass, then there’s a broadcast along the same axis during the backward pass gradients. Vice-versa for broadcasting during the forward pass.

Chain rule

  • is a function of , and . where is , is , is
  • Due to the “bottleneck”-y nature of backprop, for the gradient of a parameter weight in a MLP, you only need the gradients of the activation vector dot product with the previous activations
  • If , then to compute