Summary

Two patterns or perspectives about SFT:
- Pre-training perspective
  - You can teach what model does not know
  - Higher tolerance for noise? e.g., low-quality ones
  - How many synthesized data we want to put into it?
  - Packing
- Alignment perspective
  - Quality is important, we are teaching models about appropriate behaviors (lima paper)
  - The model should not answer what it does not know (different from pretraining)
  - Mask out the instruction
AllenAI to the rescue
- Tulu2 already aggregates a fairly high quality mixture of FLAN, CoT, synthetic, and alignment data. (SuperNI, CoT, FlanV2, Alpaca, Code-Alpaca, …)
Some new interesting work was recently released (MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following)
- where given any piece of text (e.g. swiss legal data), we can create high quality instruction data for it
- It shows very strong results.
- In their case, they sample from Dolma
- Caveat. Requires usage of competent LLMs to “create” the instructions given input. Also required to generate the output given (instruction, input) pairs.
- 68K (instruction, input, output) instances in MUFFIN
- Potential usage: custom IFT data for swiss matters.
Training tricks exist for SFT
- NEFTune is a regularization technique, just noise the embeddings during SFT to avoid overfitting.
Whether at large scale (multi-task, FLAN) or small scale (LIMA), instruction formatting&content is crucial e.g. chain of thought, few-shot, acknowledging the question before answering.
- Tokens can have very different information density, so give language models time to think. (Jason Wei)
Things to keep in mind for instruction data
- diversity/ mixture ()
- complexity
- quality
- format (chatML-style)

🤖 Harold's Notes

Explorer

Summary

Graph View

Backlinks