• You should not use a bias in the linear layer before the BN layer (because removed by normalization)
  • BN ensures that (at init time at least), all the features are unit gaussian (no matter how weird your NN is architecture-wise). Then the gain and bias allow the NN to shift the gaussian distribution to its ease.
  • After any linear layer, CNN or Attention layer.
  • Introduces dependency on mean and std parameter at test time.
  • No one likes this but the next effect is what makes batchNorm hard to remove from NN training.
    • Also introduces dependency between examples in the batch (as an outlier will influence the mean and thus all the activations of each individual example). This noise is actually helpful most of the time, as it acts as a regularizer