-
Camels in a Changing Climate: Enhancing LM Adaptation with TÜLU 2
-
326,154 samples (https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)
-
Data mixture
-
FLAN [Chung et al., 2022]: We use 50,000 examples sampled from FLAN v2.
-
CoT: To emphasize chain-of-thought (CoT) reasoning, we sample another 50,000 examples from the CoT subset of the FLAN v2 mixture.
-
Open Assistant 1 [Köpf et al., 2023]: We isolate the highest-scoring paths in each conversation tree and use these samples, resulting in 7,708 examples. Scores are taken from the quality labels provided by the original annotators of Open Assistant 1.
-
ShareGPT: We use all 114,046 examples from our processed ShareGPT dataset, as we found including the ShareGPT dataset resulted in strong performance in prior work.
-
GPT4-Alpaca [Peng et al., 2023]: We sample 20,000 samples from GPT-4 Alpaca to further include distilled GPT-4 data.
-
Code-Alpaca [Chaudhary, 2023]: We use all 20,022 examples from Code Alpaca, following our prior V1 mixture, in order to improve model coding abilities.
-
LIMA [Zhou et al., 2023]: We use 1,030 examples from LIMA as a source of carefully curated data.
-
WizardLM Evol-Instruct V2 [Xu et al., 2023]: We sample 30,000 examples from WizardLM, which contains distilled data of increasing diversity and complexity.
-
Open-Orca [Lian et al., 2023]: We sample 30,000 examples generated by GPT-4 from OpenOrca, a reproduction of Orca [Mukherjee et al., 2023], which augments FLAN data with additional model-generated explanations.
-
Science literature: We include 7,544 examples from a mixture of scientific document understanding tasks— including question answering, fact-checking, summarization, and information extraction. A breakdown of tasks is given in Appendix C