-
Two patterns or perspectives about SFT:
- Pre-training perspective
- You can teach what model does not know
- Higher tolerance for noise? e.g., low-quality ones
- How many synthesized data we want to put into it?
- Packing
- Alignment perspective
- Quality is important, we are teaching models about appropriate behaviors (lima paper)
- The model should not answer what it does not know (different from pretraining)
- Mask out the instruction
- Pre-training perspective
-
AllenAI to the rescue
- Tulu2 already aggregates a fairly high quality mixture of FLAN, CoT, synthetic, and alignment data. (SuperNI, CoT, FlanV2, Alpaca, Code-Alpaca, …)
-
Some new interesting work was recently released (MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following)
- where given any piece of text (e.g. swiss legal data), we can create high quality instruction data for it
- It shows very strong results.
- In their case, they sample from Dolma
- Caveat. Requires usage of competent LLMs to “create” the instructions given input. Also required to generate the output given (instruction, input) pairs.
- 68K (instruction, input, output) instances in MUFFIN
- Potential usage: custom IFT data for swiss matters.
-
Training tricks exist for SFT
- NEFTune is a regularization technique, just noise the embeddings during SFT to avoid overfitting.
-
Whether at large scale (multi-task, FLAN) or small scale (LIMA), instruction formatting&content is crucial e.g. chain of thought, few-shot, acknowledging the question before answering.
- Tokens can have very different information density, so give language models time to think. (Jason Wei)
-
Things to keep in mind for instruction data
-
diversity/ mixture ()
-
complexity
-
quality
-
format (chatML-style)
-