• loss.grad is initialized to 1 when loss.backward is called. Makes sense because
  • Technically, these variables which we call grad_ are not really gradients; they’re really Jacobians left-multiplied by a vector