Control Vectors

A control vector is a vector (technically a list of vectors, one per layer) that you can apply to model activations during inference to control the model’s behavior without additional prompting. (quite similar to LoRA)
1. Build a dataset of contrasting prompt pairs. For example, ("[INST] Act extremely happy. [/INST] I am", "[INST] Act extremely sad. [/INST] I am"), where the part after [/INST] is a diverse set of short suffixes for the model to complete.

Run the target model forward over that dataset, collecting the hidden states of each layer for the last token prediction, where the model predicts a continuation of those diverse suffixes with the given personas.
Take the difference of the positive and negative example hidden states to get a set of relative hidden states.
Use single-component PCA on those relative hidden states to get a control vector for each layer.

🤖 Harold's Notes