• Camels in a Changing Climate: Enhancing LM Adaptation with TÜLU 2

  • 326,154 samples (https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)

  • Data mixture

  • FLAN [Chung et al., 2022]: We use 50,000 examples sampled from FLAN v2.

  • CoT: To emphasize chain-of-thought (CoT) reasoning, we sample another 50,000 examples from the CoT subset of the FLAN v2 mixture.

  • Open Assistant 1 [Köpf et al., 2023]: We isolate the highest-scoring paths in each conversation tree and use these samples, resulting in 7,708 examples. Scores are taken from the quality labels provided by the original annotators of Open Assistant 1.

  • ShareGPT: We use all 114,046 examples from our processed ShareGPT dataset, as we found including the ShareGPT dataset resulted in strong performance in prior work.

  • GPT4-Alpaca [Peng et al., 2023]: We sample 20,000 samples from GPT-4 Alpaca to further include distilled GPT-4 data.

  • Code-Alpaca [Chaudhary, 2023]: We use all 20,022 examples from Code Alpaca, following our prior V1 mixture, in order to improve model coding abilities.

  • LIMA [Zhou et al., 2023]: We use 1,030 examples from LIMA as a source of carefully curated data.

  • WizardLM Evol-Instruct V2 [Xu et al., 2023]: We sample 30,000 examples from WizardLM, which contains distilled data of increasing diversity and complexity.

  • Open-Orca [Lian et al., 2023]: We sample 30,000 examples generated by GPT-4 from OpenOrca, a reproduction of Orca [Mukherjee et al., 2023], which augments FLAN data with additional model-generated explanations.

  • Science literature: We include 7,544 examples from a mixture of scientific document understanding tasks— including question answering, fact-checking, summarization, and information extraction. A breakdown of tasks is given in Appendix C