- A control vector is a vector (technically a list of vectors, one per layer) that you can apply to model activations during inference to control the model’s behavior without additional prompting. (quite similar to LoRA)
-
- Build a dataset of contrasting prompt pairs. For example,
("[INST] Act extremely happy. [/INST] I am", "[INST] Act extremely sad. [/INST] I am")
, where the part after [/INST]
is a diverse set of short suffixes for the model to complete.
- Run the target model forward over that dataset, collecting the hidden states of each layer for the last token prediction, where the model predicts a continuation of those diverse suffixes with the given personas.
- Take the difference of the positive and negative example hidden states to get a set of relative hidden states.
- Use single-component PCA on those relative hidden states to get a control vector for each layer.