loss.grad is initialized to 1 when loss.backward is called. Makes sense because dLdL=1 Technically, these variables which we call grad_ are not really gradients; they’re really Jacobians left-multiplied by a vector