- full-parameter finetuning changes every parameter
- LoRa is a generalization of finetuning by introducing two knobs
- how many parameters (specifically matrices) we update
- how much we update the matrices (i.e. by varying the rank)
- So, during the finetuning, for the matrices we want to update, we just introduce another residual path with a rank r matrix (can be written as W∗=AB where A is dxr and B rxd), which we optimize by gradient
- At inference time, we can just add W∗ to the original matrix (so same inference time), and we can switch between different LoRAs, depending on the task, just by substracting out W∗
- One can stack multiple LoRAs together, and it tends to work