• Two patterns or perspectives about SFT:

    • Pre-training perspective
      • You can teach what model does not know
      • Higher tolerance for noise? e.g., low-quality ones
      • How many synthesized data we want to put into it?
      • Packing
    • Alignment perspective
      • Quality is important, we are teaching models about appropriate behaviors (lima paper)
      • The model should not answer what it does not know (different from pretraining)
      • Mask out the instruction
  • AllenAI to the rescue

    • Tulu2 already aggregates a fairly high quality mixture of FLAN, CoT, synthetic, and alignment data. (SuperNI, CoT, FlanV2, Alpaca, Code-Alpaca, …)
  • Some new interesting work was recently released (MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following)

    • where given any piece of text (e.g. swiss legal data), we can create high quality instruction data for it
    • It shows very strong results.
    • In their case, they sample from Dolma
    • Caveat. Requires usage of competent LLMs to “create” the instructions given input. Also required to generate the output given (instruction, input) pairs.
    • 68K (instruction, input, output) instances in MUFFIN
    • Potential usage: custom IFT data for swiss matters.
  • Training tricks exist for SFT

    • NEFTune is a regularization technique, just noise the embeddings during SFT to avoid overfitting.
  • Whether at large scale (multi-task, FLAN) or small scale (LIMA), instruction formatting&content is crucial e.g. chain of thought, few-shot, acknowledging the question before answering.

    • Tokens can have very different information density, so give language models time to think. (Jason Wei)
  • Things to keep in mind for instruction data

    • diversity/ mixture ()

    • complexity

    • quality

    • format (chatML-style)